Marijuana Related Crime (denver)

Authors

first author’s name (Niloufar)

second author’s name (Sergey)

third author’s name (Nassim)

Published

December 1, 2022

Abstract
The principal question of the project was determination of key features and factors of MJ-related crimes committed in Denver in connection with industrial (on the one hand) and non-industrial (on the other hand) objects or victims. The data used was taken from the Denver Police Department and covered the period between years 2015 and 2020. The project highlighted several geographycal units where certain kinds of crimes are more likely to be committed and suggested that MJ-related delinquency in Denver is criminologically closer to property-oriented one and should be combatted accordingly. However, the most important outcome was that any single aspect of a crime (be it locale, type or time) is never sufficient to make good predictions on the rest of the aspects, at least on industrial/non-industrial nature of the crime.

Introduction

Questions

For this project, we want to answer the following questions which are about MArijuana related crimes in Denver, Colorado.

  1. Geographical issues. How much location of the district contribute into crime features (like offense_type and offense_category by neighbourhoods)? Most interestingly we may put it onto the city map and understand does airport nearby help planting weed, do people do it in their flats or rather in countryhouses etc. By this question, we want to investigate if there is any correlation between the type of crime and location

  2. Criminology issues. Relation between Marijuana and other types of crimes (e. g. are they against property or rather violent?).

  3. Machine learning problems: classify whether certrain crime is industry or non-industry type.

With what data?

In order to do our project, we selected a data set from kaggle, called Marijuana related Crime. The dataframe has a shape of 14 variables (6 numeric and 8 char type) and 1201 observations. Vast majority of columns have a few NA values.

The crimes included into the dataset are related or connected with marijuana in any manner, either directly or indirectly. For instance, there may be a crime when marijuana was sold or illegally possessed as well as a crime when just the criminal was a marijuana consumer stealing something in the most ordinary way. Crimes when marijuana itself was a crime target (e. g. was stolen, exacted, its cultivation was illegally infringed) are NOT included into the data.

Marijuana is legalized in the state of Colorado where Denver is located. So when you see a crime labeled as ‘Marijuana possession’, this means that the amount of marijuana was beyond the legal limits, or its quality did not match the law (THC content was too high), or a person was under the age to possess it.

All the columns are more or less clear in terms of waht they mean except for MJ_RELATION_TYPE taking values ‘INDUSTRY’ or ‘NON-INDUSTRY’. Industry-related crimes involve marijuana and licensed marijuana facilities. These reported crimes are committed against the licensed industry or by the industry itself. Non-Industry crimes are crimes reported where marijuana is the primary target in the commission of these crime but the marijuana has no readily apparent tie to a licensed operation.

At the same time all the crimes are somehow related to marijuana or connected with it. This means no comparison with non-MJ related crimes may be presented unless extra data is pulled.

The link of data set is as follows: https://www.kaggle.com/datasets/jinbonnie/crime-and-weed

Where does the data come from, how was it collected?

Data in this file are crimes reported to the Denver Police Department which, upon review, were determined to have clear connection or relation to marijuana. These data do not include police reports for violations restricting the possession, sale, and/or cultivation of marijuana. This dataset is based upon the National Incident Based Reporting System (NIBRS) which includes all victims of person crimes and all crimes within an incident. The data is dynamic, which allows for additions, deletions and/or modifications at any time, resulting in more accurate information in the database. Due to continuous data entry, the number of records in subsequent extractions are subject to change.

To Start our project, first we import the packages and our data set.

Code
import warnings
warnings.filterwarnings('ignore') 
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
import statsmodels.api as sm

rawdf = pd.read_csv("crime_marijuana.csv")
rawdf.head()
INCIDENT_ID FIRST_OCCURENCE_DATE LAST_OCCURENCE_DATE REPORTDATE INCIDENT_ADDRESS GEO_X GEO_Y DISTRICT_ID PRECINCT_ID OFFENSE_CODE OFFENSE_TYPE_ID OFFENSE_CATEGORY_ID MJ_RELATION_TYPE NEIGHBORHOOD_ID
0 2017151765 06-MAR-17 06-MAR-17 06-MAR-17 2207 N HOOKER ST 3132526.0 1698468.0 1 121 3563 DRUG - MARIJUANA CULTIVATION Drug Offenses NON-INDUSTRY\r sloan-lake
1 20184912 03-JAN-18 03-JAN-18 03-JAN-18 4400 E EVANS AVE 3158749.0 1672408.0 3 314 2205 BURGLARY - BUSINESS NO FORCE Burglary INDUSTRY\r university-hills
2 20184942 03-JAN-18 03-JAN-18 03-JAN-18 3435 S YOSEMITE ST 3173094.0 1663993.0 3 323 2203 BURGLARY - BUSINESS BY FORCE Burglary INDUSTRY\r hampden
3 201666719 01-FEB-16 01-FEB-16 01-FEB-16 5050 N YORK ST 3152054.0 1712498.0 2 212 2203 BURGLARY - BUSINESS BY FORCE Burglary INDUSTRY\r elyria-swansea
4 2016317585 21-MAY-16 21-MAY-16 21-MAY-16 3888 E MEXICO AVE 3157004.0 1674967.0 3 312 2203 BURGLARY - BUSINESS BY FORCE Burglary INDUSTRY\r cory-merrill

Analysis

Cleaning the dataset

In order to do the exploratory data analysis (EDA), we check the types of variables and the missing values. Our data set has 6 numerical and 8 string variables. We have 3 string variables which should be converted into Date Type. We have one missing value in ‘GEO_X’, one missing value in ‘GEO_Y’ and 338 missing values in LAST_OCCURENCE_DATE. So we need to make a decision for the missing values of this variable. Based on the data set documentation, the LAST_OCCURENCE_DATE is basically same as the happened date. Apart from the NaN values, only in 127 cases the LAST_OCCURENCE_DATE does not match the FIRST_OCCURENCE_DATE. As witnessed by area-specific knowledge analysis, the types of the crimes where the FIRST_OCCURENCE_DATE does not match the LAST_OCCURENCE_DATE are mostly non continuous by nature (e. g. burglaries, robberies, thefts etc.). Moreover, only in 30 cases the LAST_OCCURENCE_DATE is not equal to the FIRST_OCCURENCE_DATE and REPORT_DATE at the same time. This means that non-matching values are more likely to be simple errors that to be caused by actual lack of information. At least the remainder, where the mismatch may indeed be informative, is not really large and cannot be extrapolated. Ergo we can replace these missing data with FIRST_OCCURENCE_DATE.

After filling the NaN values of LAST_OCCURENCE_DATE, we drop the rows where we have missing value in ‘GEO_X’ and ‘GEO_Y’.

Code
rawdf.info()
print(rawdf.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201 entries, 0 to 1200
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   INCIDENT_ID           1201 non-null   int64  
 1   FIRST_OCCURENCE_DATE  1201 non-null   object 
 2   LAST_OCCURENCE_DATE   863 non-null    object 
 3   REPORTDATE            1201 non-null   object 
 4   INCIDENT_ADDRESS      1201 non-null   object 
 5   GEO_X                 1200 non-null   float64
 6   GEO_Y                 1200 non-null   float64
 7   DISTRICT_ID           1201 non-null   int64  
 8   PRECINCT_ID           1201 non-null   int64  
 9   OFFENSE_CODE          1201 non-null   int64  
 10  OFFENSE_TYPE_ID       1201 non-null   object 
 11  OFFENSE_CATEGORY_ID   1201 non-null   object 
 12  MJ_RELATION_TYPE      1201 non-null   object 
 13  NEIGHBORHOOD_ID       1201 non-null   object 
dtypes: float64(2), int64(4), object(8)
memory usage: 131.5+ KB
INCIDENT_ID               0
FIRST_OCCURENCE_DATE      0
LAST_OCCURENCE_DATE     338
REPORTDATE                0
INCIDENT_ADDRESS          0
GEO_X                     1
GEO_Y                     1
DISTRICT_ID               0
PRECINCT_ID               0
OFFENSE_CODE              0
OFFENSE_TYPE_ID           0
OFFENSE_CATEGORY_ID       0
MJ_RELATION_TYPE          0
NEIGHBORHOOD_ID           0
dtype: int64
We have one missing value in ‘GEO_X’, one missing value in ‘GEO_Y’ and 338 missing values in LAST_OCCURENCE_DATE. So we need to make a decision for the missing values of this variable. Based on the data set documentation, the LAST_OCCURENCE_DATE is basically same as the happened date. But why should we believe that is true?
Code
print(rawdf[rawdf['FIRST_OCCURENCE_DATE'] != rawdf['LAST_OCCURENCE_DATE']]['LAST_OCCURENCE_DATE'].count())  
print(rawdf[rawdf['FIRST_OCCURENCE_DATE'] != rawdf['LAST_OCCURENCE_DATE']]['OFFENSE_TYPE_ID'].value_counts())
print(rawdf[rawdf['FIRST_OCCURENCE_DATE'] != rawdf['LAST_OCCURENCE_DATE']][rawdf['LAST_OCCURENCE_DATE']!=rawdf['REPORTDATE']].dropna(axis=0).count())      
127
BURGLARY - BUSINESS BY FORCE      82
THEFT - OTHER                     54
CRIMINAL MISCHIEF - OTHER         42
ROBBERY - STREET                  36
DRUG - MARIJUANA SELL             24
THEFT - SHOPLIFT                  21
ASSAULT - SIMPLE                  18
CRIMINAL TRESPASSING              17
PUBLIC ORDER CRIMES - OTHER       17
AGGRAVATED ASSAULT                14
ROBBERY - BUSINESS                13
THREATS TO INJURE                 13
THEFT - FROM BLDG                 12
THEFT - ITEMS FROM VEHICLE        11
ROBBERY - RESIDENCE                9
DRUG - MARIJUANA CULTIVATION       7
BURGLARY - RESIDENCE BY FORCE      6
BURGLARY - RESIDENCE NO FORCE      5
BURGLARY - BUSINESS NO FORCE       5
MENACING - FELONY W/WEAP           5
DISTURBING THE PEACE               4
THEFT - OF MOTOR VEHICLE           4
CRIMINAL MISCHIEF - GRAFFITI       3
DRUG - MARIJUANA POSSESS           3
ROBBERY - CAR JACKING              2
DRUG - COCAINE POSSESS             2
DRUG - PCS - OTHER DRUG            2
ROBBERY - PURSE SNATCH W/FORCE     2
FRAUD - CRIMINAL IMPERSONATION     2
THEFT - UNAUTH USE OF FTD          2
BURGLARY - POSS. OF TOOLS          2
FORGERY - OTHER                    2
WEAPON-BY PREV OFFENDER-POWPO      1
FORGERY - POSSES FORGE DEVICE      1
ARSON - BUSINESS                   1
OTHER ENVIRONMENT/ANIMAL VIOL      1
ASSAULT - DV                       1
THEFT - PARTS FROM VEHICLE         1
FORGERY - POSS. OF FORGED FTD      1
ARSON - RESIDENCE                  1
POLICE - RESISTING ARREST          1
BURGLARY - VENDING MACHINE         1
WEAPON - FIRE INTO OCC BLDG        1
DRUG - METHAMPETAMINE POSSESS      1
POLICE - FALSE INFORMATION         1
THEFT - EMBEZZLE                   1
THEFT - PURSE SNATCH NO FORCE      1
KIDNAP - ADULT VICTIM              1
EXPLOSIVE/INCENDIARY DEV - POS     1
LIQUOR - POSSESSION                1
FORGERY - POSS OF FORGED INST      1
WEAPON- UNLAWFUL DISCHARGE OF      1
CRIMINAL MISCHIEF - MTR VEH        1
WEAPON-POSS ILLEGAL/DANGEROUS      1
THEFT - PICK POCKET                1
DRUG - HEROIN POSSESS              1
Name: OFFENSE_TYPE_ID, dtype: int64
INCIDENT_ID             30
FIRST_OCCURENCE_DATE    30
LAST_OCCURENCE_DATE     30
REPORTDATE              30
INCIDENT_ADDRESS        30
GEO_X                   30
GEO_Y                   30
DISTRICT_ID             30
PRECINCT_ID             30
OFFENSE_CODE            30
OFFENSE_TYPE_ID         30
OFFENSE_CATEGORY_ID     30
MJ_RELATION_TYPE        30
NEIGHBORHOOD_ID         30
dtype: int64
So we see that the LAST_OCCURENCE_DATE does not match the FIRST_OCCURENCE_DATE not only if it is None, but in 127 more cases. What the nature of these mismatches might be? As witnessed by area-specific knowledge analysis, the types of the crimes where the FIRST_OCCURENCE_DATE does not match the LAST_OCCURENCE_DATE are mostly not continuous by nature (e. g. burglaries, robberies, thefts etc.). Moreover, in 30 cases only the LAST_OCCURENCE_DATE does not equal FIRST_OCCURENCE_DATE and REPORT_DATE at the same time. This means that non-matching values are more likely to be simple errors that to be caused by actual lack of information. At least the remainder, where the mismatch may indeed be informative, is not really large and cannot be extrapolated. Ergo we can replace these missing data with FIRST_OCCURENCE_DATE.
Code
df = rawdf.copy()
df['LAST_OCCURENCE_DATE'] = df['LAST_OCCURENCE_DATE'].fillna(df['FIRST_OCCURENCE_DATE'])
print(df.isnull().sum())
INCIDENT_ID             0
FIRST_OCCURENCE_DATE    0
LAST_OCCURENCE_DATE     0
REPORTDATE              0
INCIDENT_ADDRESS        0
GEO_X                   1
GEO_Y                   1
DISTRICT_ID             0
PRECINCT_ID             0
OFFENSE_CODE            0
OFFENSE_TYPE_ID         0
OFFENSE_CATEGORY_ID     0
MJ_RELATION_TYPE        0
NEIGHBORHOOD_ID         0
dtype: int64

Now we drop the rows where we have missing value in ‘GEO_X’ and ‘GEO_Y’

Code
df = df.dropna()
print(df.isnull().sum())
df.shape
INCIDENT_ID             0
FIRST_OCCURENCE_DATE    0
LAST_OCCURENCE_DATE     0
REPORTDATE              0
INCIDENT_ADDRESS        0
GEO_X                   0
GEO_Y                   0
DISTRICT_ID             0
PRECINCT_ID             0
OFFENSE_CODE            0
OFFENSE_TYPE_ID         0
OFFENSE_CATEGORY_ID     0
MJ_RELATION_TYPE        0
NEIGHBORHOOD_ID         0
dtype: int64
(1200, 14)

Working with time series

In order to use the time series, we need to change the type of date variables into date. As it was mentioned earlier, we have 3 date variables but their type is string. In order to use the time series, we need to change the type of these variables.

After changing the type, we added 5 new variables into our data set. Then we visualised the yearly trend of the number of crimes. In year 2020, we only have 19 data for the month January.

Code
FOD_series = pd.Series(data = [item.split("-")[0] + "-" + item.split("-")[1] + "-" + item.split("-")[2] for item in df['FIRST_OCCURENCE_DATE']], index=df.index)
RD_series = pd.Series(data = [item.split("-")[0] + "-" + item.split("-")[1] + "-" + item.split("-")[2] for item in df['REPORTDATE']], index=df.index)
LOD_series = pd.Series(data = [item.split("-")[0] + "-" + item.split("-")[1] + "-" + item.split("-")[2] for item in df['LAST_OCCURENCE_DATE']], index=df.index)
FOD_series = pd.to_datetime(FOD_series)
LOD_series = pd.to_datetime(LOD_series)
RD_series = pd.to_datetime(RD_series)
df.drop(['FIRST_OCCURENCE_DATE' , 'LAST_OCCURENCE_DATE','REPORTDATE'], axis = 1, inplace = True)
df.insert(loc=0, column ='first_occurence_date', value= FOD_series)
df.insert(loc=1, column ='last_occurence_date', value= LOD_series)
df.insert(loc=2, column = 'reported_date' , value = RD_series)

year_series = FOD_series.dt.year # Getting year values
month_series = FOD_series.dt.month # Getting month values
day_series = FOD_series.dt.day # Getting day values as integers
day_name_series = FOD_series.dt.day_name() # Getting the days of a week, i.e., Monday, Tuesday, Wednesday etc.

# Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.
df['Year'] = year_series
df['Month'] = month_series
df['Day'] = day_series
df['Day Name'] = day_name_series

# Creating the duration variable
duration =(df['last_occurence_date']-df['first_occurence_date'])
duration = duration.apply(lambda x: x.days)
df.insert(loc=4, column='duration', value = duration)

Visualisation on the map

The problem with this dataset is no long-lat coordinates. Instead we have SPC (state plane coordinates). In the following, we transform the data and map the location of crimes on the map of the Denver. Then we create a new variable named dist which is the distance of occurance place from the city center.
Code
from pyproj import Proj, transform
from pyproj import CRS ,Transformer
transformer = Transformer.from_crs(2232, 4326)

df['lat'] = df.apply(lambda x  : transformer.transform(x.GEO_X,x.GEO_Y)[0],axis=1)
df['long'] = df.apply(lambda x  : transformer.transform(x.GEO_X,x.GEO_Y)[1],axis=1)

fig1. This is the way the crimes are located in comparison to each other. More informative maps will follow.

Code
fig,ax = plt.subplots()
ax.plot(df['long'],df['lat'],'r+')
plt.show()

fig2. We use these bounds from BBox to export a picture of the streets of denver from openstreetmaps. It is a fast and easy way to get an overview. Then we import the picture and plot the data points within this picture.

Code
BBox = np.append(ax.get_xlim(),      
         ax.get_ylim())
print(BBox)
denmap = plt.imread('map.png')
plt.figure(figsize=(8, 8))
plt.imshow(denmap, zorder=0, extent = BBox, aspect= 'equal')
plt.plot(df.long,df.lat,'r+')
plt.show() 
[-105.12899582 -104.68699093   39.61785623   39.84705498]

Based on above map, it is evident that the majority of crimes are occuring in the middle of the city and in some areas like the South-East or North-west have less crimes than other parts.

In addition, we define a new variable which is called “dist” and we calculate the distance of the crime from the city center by it:

Code
from math import radians,cos, sin, asin, acos, sqrt, atan2
centerlat = 39.7392
centerlong = -104.9903
def calculate_spherical_distance(lat,long):
    # Convert degrees to radians
    coordinates = lat, long, centerlat, centerlong
    phi1, lambda1, phi2, lambda2 = [radians(c) for c in coordinates]
    # Apply the haversine formula
    a = sin((phi2-phi1)/2)**2 + cos(phi1) * cos(phi2) * sin((lambda2-lambda1)/2)**2
    d = 2*6371*atan2(sqrt(a),sqrt(1-a))
    return d
df['dist'] = df.apply(lambda x: calculate_spherical_distance(x['lat'], x['long']), axis=1)
df['dist'].describe()
count    1200.000000
mean        5.811882
std         3.471711
min         0.252880
25%         3.399797
50%         5.299129
75%         7.186625
max        26.514020
Name: dist, dtype: float64

Adding an additional label column

In forensic sciences, a major distinction is made between violent and non-violent crimes. Let us label the offences as violent or not.

Code
def labeller(string):
  violent = 'violent'
  nonviolent = 'non-violent'
  vcrimes = ['ASLT', 'ASSAULT', 'ROBBERY', 'THREATS TO INJURE', 'BY FORCE', 'KIDNAP', 'ARSON', 'W/ FORCE', 'MENACING', 'CAR JACKING']
  label = nonviolent
  for item in vcrimes:
    if item in string:
      label = violent
  if 'NO FORCE' in string:
    label = nonviolent
  return label
df['VIOLENCE_RELATION']=df['OFFENSE_TYPE_ID'].apply(labeller)

Visualisation

Geographycal issues

To begin with, let us look at offence categories on the map

Code
sns.lmplot( x='GEO_X', y='GEO_Y', data=df, fit_reg = False, hue='OFFENSE_CATEGORY_ID', palette="Set1", legend=False)
plt.legend(bbox_to_anchor=(1.7, 1), loc='upper right')
plt.show()

fig3. This seems to be distributed more or leass equally without any apparent tendency.

Next, let us see the MJ relation type on the map:

Code
sns.scatterplot( x='GEO_X', y='GEO_Y', data=df, size='MJ_RELATION_TYPE',sizes=(10,100), alpha = 0.2)
plt.legend(['Non-Industry', 'Industry'], loc='lower right')
plt.show()

fig4. Similar to previous graph, there is not an apperent tendency here. However, we can see for GEO-X more than 3.19 and GEO-Y more than 1.72, Industry crimes did not occur.

Finally, observe violent and non-violent crimes on the map

Code
sns.lmplot( x='GEO_X', y='GEO_Y', data=df, fit_reg = False, hue='VIOLENCE_RELATION', palette="Set1", legend=False)
plt.legend(loc='lower right')
plt.show()

fig5. From this map, it is evident that the majority of violent and non-violent crimes were committed between 3.14 to 3.16 in GEO-X. But they are distributed in a wider range of GEO-Y (1.67 to 1.71).

Visualizaion of crime categories

To begin with, let us observe the offence categories distribution

Code
ax = sns.countplot(y=df['OFFENSE_CATEGORY_ID'], order = df['OFFENSE_CATEGORY_ID'].value_counts().index, orient ='h')
plt.show()

fig6. At the first glance, we can see the most of the crimes are Burglary offences (almost 700) and Larceny, Robbery-street-Res and criminal-Mischief-Property are 150, 100, 90 respectively.

Apart from offence categories, we also have offence types (being more specific and consequently more numerous). We will not observe it because there are too many, and vast majority of them have a very few observtions.

Code
ax = sns.countplot(data = df, x=df.MJ_RELATION_TYPE)
ax.set_xticks(range(0, len(df.MJ_RELATION_TYPE.unique())))
ax.set_xticklabels(labels = ['Non-Industry', 'Industry'])
plt.show()

fig7. Based on above barchart, the number of industry MJ-crimes is about four times more than of non-industrial.

Code
ax = sns.countplot(data = df, x=df['VIOLENCE_RELATION'])
ax.set_xticks(range(0, len(df['VIOLENCE_RELATION'].unique())))
ax.set_xticklabels(labels = df['VIOLENCE_RELATION'].unique())
plt.show()

fig8. We can see, the number of violent crimes is twice the number of non-violent crimes.

Trend of MJ Crime over the time

Another important topic that we have to consider is analysing the main variables trend over the time. For doing that, first of all, we define some variables related to time and then visualise the yearly trend of crimes.

Code
Yearlabels = ['2015','2016','2017', '2018', '2019']
df['Year'].unique()
df_2015 = df.loc[df['Year']==2015]
df_2016 = df.loc[df['Year']==2016]
df_2017 = df.loc[df['Year']==2017]
df_2018 = df.loc[df['Year']==2018]
df_2019 = df.loc[df['Year']==2019]

plt.figure(figsize=(15, 5))
plt.title('Yearly Trend of number of Crimes per Month')
plt.plot(df_2015['Month'].sort_values().unique().reshape(-1,1), df_2015.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'g-o',label='2015')
plt.plot(df_2016['Month'].sort_values().unique().reshape(-1,1), df_2016.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'b-s',label='2016')
plt.plot(df_2017['Month'].sort_values().unique().reshape(-1,1), df_2017.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'k-^',label='2017')
plt.plot(df_2018['Month'].sort_values().unique().reshape(-1,1), df_2018.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'r--+',label='2018')
plt.plot(df_2019['Month'].sort_values().unique().reshape(-1,1), df_2019.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'y-o',label='2019')
plt.xlabel('Month')
plt.ylabel('Number of Crimes')
plt.legend(bbox_to_anchor=(1, 1), loc='upper right')
plt.grid(True)
plt.show()

fig9. At first glance, it is evident that during the year there is fluctuation behavavior for distinct years. Whereas, in some months the general trend can be observed in number of crimes. For instance, in August there is a sudden increase for almost all years. In addition, except 2018, in all of the years, a dramatic drop for the number of crimes can be seen from September to October or November to December.
Trend of violent and non-violent crime over the time

In 2020, we only have data of January and thats why the number of crimes are less than others. The plot shows that the number of violent crimes in each year is more than the number of non-violent ones.

Code
plt.figure(figsize=(5,5))
valuetable = pd.crosstab(df[df.Year!=2020]['Year'],df['VIOLENCE_RELATION']) # ,normalize='index'
valuetable.plot.bar(stacked=True)
plt.title('Yearly Trend')
plt.xticks(rotation='horizontal')
plt.xlabel('Year')
plt.ylabel('count')
order = [0,1]
handles =['Non-violent', 'Violent']
plt.legend(bbox_to_anchor=(1, 1), loc='upper right', labels=['Non-violent', 'Violent'])
plt.show()
<Figure size 480x480 with 0 Axes>

fig10. Above stacked barplot can show us the number of violent crimes is more than non-violent crimes which is already clear. The proportion of violent and non-violent crimes also seems to be roughly the same.

Code
Monthlabels = ['January','February','March','April','May','June','July','August', 'September', 'October', 'November', 'December']
Monthtickorder = list(list(df['Month'].unique())[i]-1 for i in range(0, 12))
Monthtickorder.sort()
plt.figure(figsize=(5,5))
valuetable = pd.crosstab(df[df.Year!=2020]['Month'],df['VIOLENCE_RELATION']) # ,normalize='index'
valuetable.plot.bar(stacked=True)
plt.title('Monthly Trend')
plt.xticks(ticks=Monthtickorder, labels=Monthlabels)
plt.xlabel('Month')
plt.ylabel('count')
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper right', labels=['Non-violent', 'Violent'])
plt.show()
<Figure size 480x480 with 0 Axes>

fig11. Trend of violent and non-violent crimes over the months illustrates that from June to September the number of violent and non-violent crimes increased steadily. While, in other months, the trend fluctuated. The proportion is indeed different.
Code
Daylabels = ['Monday', 'Tuesday','Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday']
valuetable = pd.crosstab(df['Day Name'],df['VIOLENCE_RELATION']).reset_index()
valuetable.set_index('Day Name', inplace = True)
valuetable = valuetable.reindex(Daylabels).reset_index()
valuetable.plot.bar(stacked=True)
plt.title('Daily Trend')
plt.xlabel('Day of Week')
plt.ylabel('count')
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper right', labels=['Non-violent', 'Violent'])
plt.xticks(ticks=range(0, 7), labels=Daylabels, rotation=0, fontsize=9)
plt.show()

fig12. The number of violent crimes between Tuesday and Thursday is more than other days. At the same time the proportions seems to be different for working and non-working days.

Trend of industry and non-industry crimes over the time
Code
plt.figure(figsize=(5,5))
valuetable = pd.crosstab(df[df.Year!=2020]['Year'],df['MJ_RELATION_TYPE'])
valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])
plt.title('Yearly Trend')
plt.xticks(rotation='horizontal')
plt.xlabel('Year')
plt.ylabel('count')
order = [0,1]
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1, 1), loc='upper right', labels=['Non-Industry', 'Industry'])
plt.show()
<Figure size 480x480 with 0 Axes>

fig13. We can observe that the number of non-industry crimes had decreased gradually. The proportion of non-industry crimes was decreasing all the way long.

Code
Monthlabels = ['January','February','March','April','May','June','July','August', 'September', 'October', 'November', 'December']
Monthtickorder = list(list(df['Month'].unique())[i]-1 for i in range(0, 12))
Monthtickorder.sort()
plt.figure(figsize=(5,5))
valuetable = pd.crosstab(df[df.Year!=2020]['Month'],df.MJ_RELATION_TYPE)
valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])
plt.title('Monthly Trend')
plt.xticks(ticks=Monthtickorder, labels=Monthlabels)
plt.xlabel('Month')
plt.ylabel('count')
plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1, 1), loc='upper left', labels=['Non-Industry', 'Industry'])
plt.show()
<Figure size 480x480 with 0 Axes>

fig14. Similar to violent and non-violent crimes, the proportion over months differs greatly. Compare, for instance, October and December.

Code
valuetable = pd.crosstab(df['Day Name'],df['MJ_RELATION_TYPE']).reset_index()
valuetable.set_index('Day Name', inplace = True)
valuetable = valuetable.reindex(Daylabels).reset_index()
valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])
plt.title('Daily Trend')
plt.xlabel('Day of Week')
plt.ylabel('count')
plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1.01, 1), loc='upper right', labels=['Non-Industry', 'Industry'])
plt.xticks(ticks=range(0, 7), labels=Daylabels, rotation=0, fontsize=9)
plt.show()

fig15. The split between the weekend and the rest of the week is no as obvious as it was at the respective violent/non-violent bar chart.

Plots concerning offence categories

We will be using ceategories (not types) as there are way less categories than types, so a conclusion may be more generalized. For the next plot we should note that only four principal crime categories (burglary, larceny, street robbery and property mischief). The rest of the crimes have a few observations, and when it is so a trend shown is not that informative.

Code
df_burglary = df.loc[df['OFFENSE_CATEGORY_ID']=='Burglary'][df['Year']!=2020]
df_larceny = df.loc[df['OFFENSE_CATEGORY_ID']=='Larceny'][df['Year']!=2020]
df_rsr = df.loc[df['OFFENSE_CATEGORY_ID']=='Robbery-Street-Res'][df['Year']!=2020]
df_mischief = df.loc[df['OFFENSE_CATEGORY_ID']=='Criminal Mischief-Property'][df['Year']!=2020]
Yearlabels = list(df_larceny.Year.unique())
Yearlabels.sort()

fig, (ax2, ax1) = plt.subplots(2, 1, sharex=True, figsize=(15, 5))

ax2.set_title('Trend of Number of Most Numerous Crime Types through Years')
ax2.set_ylim(85,200)
ax2.xaxis.set_ticks_position('none') 
ax2.spines.bottom.set_visible(False)

ax2.plot(df_burglary['Year'].sort_values().unique().reshape(-1,1), df_burglary.groupby('Year',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'g-o',label='Burglary')
ax1.plot(df_larceny['Year'].sort_values().unique().reshape(-1,1), df_larceny.groupby('Year',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'b-s',label='Larceny')
ax1.plot(df_rsr['Year'].sort_values().unique().reshape(-1,1), df_rsr.groupby('Year',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'k-^',label='Robbery-Street-Res')
ax1.plot(df_mischief['Year'].sort_values().unique().reshape(-1,1), df_mischief.groupby('Year',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'r--+',label='Criminal Mischief-Property')
ax1.set_xlabel('Year')
ax1.set_xticks(ticks=Yearlabels)
ax1.set_xticklabels(labels=Yearlabels)
ax1.set_ylim(ymax=35)
ax1.set_ylabel('Number of Crimes')
ax1.spines.top.set_visible(False)
ax1.xaxis.tick_bottom()

d = .5
kwargs = dict(marker=[(-1, -d), (1, d)], markersize=12,
              linestyle="none", color='k', mec='k', mew=1, clip_on=False)
ax2.plot([0, 1], [0, 0], transform=ax2.transAxes, **kwargs)
ax1.plot([0, 1], [1, 1], transform=ax1.transAxes, **kwargs)

fig.subplots_adjust(top=0.9)
fig.legend(bbox_to_anchor=(1, 1), loc='upper right')
plt.show()

fig16. In 2016 Denver experienced a gradual rise in the number of burglary crimes. However, in the same time duration the number of crimes had decreased or remained stable. In addition, the number of all kind of crimes (except Robbery-street-Res) rose again by 2018 and then in all the years a decrease trend can be observed.

Code
ax = sns.histplot(data = df, hue='Month', y = 'OFFENSE_CATEGORY_ID', palette = 'tab10', alpha=1)
ax.legend(labels=Monthlabels)
plt.show()

fig17. Here we should note that the bars are not stacked intentionally for the sake of clarity: the left-most section represents the least criminal month. So we may notice that there are some months (April, May, December, July) that win in different crimes. This displays a variance of crime types we have in the dataset.

For plotting over weekdays we again get the four most frequent crimes, otherwise the plot is hard to grasp.

Code
filtered_df = df.loc[df['OFFENSE_CATEGORY_ID'].isin(['Burglary', 'Larceny', 'Criminal Mischief-Property', 'Robbery-Street-Res'])]
ax = sns.countplot(data = filtered_df , hue='Day Name', y = 'OFFENSE_CATEGORY_ID', hue_order = Daylabels, alpha=1, orient = 'v')
ax.legend(labels = Daylabels)
plt.show()

fig18. The majority of Burglary crimes are occured on Tuesday and Thursday. Whereas, larceny crimes often are happened on Thursday and for Robbery-Street-Res and Criminal-Mischief-crimes, Sunday has the most numerous crimes. As it was true for months, there is no unique leader, many weekdays win for different crime types.

Not let us turn the picture and plot proportion of crime types over time units. The graph over the years was tried and appeared to be too boring, so we went straight to months.

Code
plt.figure(figsize=(5,5))
valuetable = pd.crosstab(filtered_df[filtered_df['Year']!=2020]['Month'],filtered_df['OFFENSE_CATEGORY_ID'])
valuetable.plot.bar(stacked=True)
plt.title('Monthly Trend')
plt.xlabel('Month')
plt.ylabel('count')
plt.xticks(ticks=Monthtickorder, labels=Monthlabels)
plt.legend(bbox_to_anchor=(1.5, 1), loc='upper right')
plt.show()
<Figure size 480x480 with 0 Axes>

fig19. First we remind that all years except 2020 were considered. Based on the plot, in August the proportion of buglaries incresed. This of April seems to be higher in January, June was famous for street robberies and property mischief was more frequent in September.

Code
valuetable = pd.crosstab(filtered_df['Day Name'],filtered_df['OFFENSE_CATEGORY_ID']).reset_index()
valuetable.set_index('Day Name', inplace = True)
valuetable = valuetable.reindex(Daylabels).reset_index()
valuetable.plot.bar(stacked=True)
plt.title('Daily Trend')
plt.xlabel('Day of Week')
plt.ylabel('count')
plt.legend(bbox_to_anchor=(1.1, 1.15), loc='upper right', labels=['Burglary', 'Criminal Mischief-Property', 'Larceny', 'Robbery-Street-Res'])
plt.xticks(ticks=range(0, 7), labels=Daylabels, rotation=0, fontsize=9)
plt.show()

fig20. Similar to fig7. this chart shows the trend of most numerous crimes over the weekdays. Here the main surprise is a very distinguished proportion for Sunday unlike any other weekday.

Relation between MJ relation type and distance

We finalize our visualization pert by some extra plots concerning our target variabe- MJ relation type. Particularly we wanted to visualize its possible links to distance from the center (‘dist’ variable).

Firstly, we plot the ‘dist’ variable itself to know its structure better.

Code
plt.figure(figsize=(10,3))
plt.subplot(121)
plt.hist(np.array(df['dist']) , density=True , bins=50, edgecolor='black' ,facecolor='pink', alpha=0.75)
plt.xlabel('Value', fontsize= 10)
plt.ylabel('Frequency', fontsize= 10)
plt.subplot(122)
sns.boxplot(y ='dist', data=df,palette="Blues")
plt.xlabel('dist')
plt.show()

fig21. By using the boxplot we can show the various measures and in particular the measures of position of a distribution or compare variables by different distributions. Here it shows that the mean of the ‘dist’ is at around 5 km away from the city center and that outliers are more than 5 times higher that the median. Moreover, we notice a narrow IQR which means that majoritry of crimes is committed between 3.5-6.5 km from the city center. The histogram confirms this conclusion.

Secondly, we plot ‘dist’ against the target value:

Code
plt.figure(figsize=(6, 6))
sns.boxplot(x='MJ_RELATION_TYPE', y='dist', data=df)
plt.xticks(ticks = [0, 1], labels = ['Non-industry', 'Industry'])
plt.show()

fig22. As we can see, based on this chart, for non-industry crimes, variability of distance from city center is more than industry crimes. However, both of them have the same median (around 5). Number of outliers for Non-Industry crimes are more than Industry crimes and are from 17-27 kilometer approximately from city center. The IQR of the industrial crimes is narrower, so we conclude that locale of these crimes is not as spread as of non-industrial.

Analysis of correlation

Correlation matrices need to be composed for two purposes. Firstly, they may be useful for identifying interesting coincidences between the variables (which, however, does not imply any causation); secondly, it will be useful to hunt for collinearity that may influence the ML-models. However, we have a very long dataframe, that is why the correlation matrix is very large. Instead of visualizing it, we rather get top-30 most correlated and anti-correlated variables as a dataframe.
Code
corrdf = df[['MJ_RELATION_TYPE', 'lat', 'long', 'DISTRICT_ID', 'OFFENSE_CATEGORY_ID',
'NEIGHBORHOOD_ID', 'Month', 'Day Name',  'VIOLENCE_RELATION', 'dist', 'duration']]
corrdf = pd.get_dummies(data=corrdf, drop_first=True, columns=['DISTRICT_ID', 'OFFENSE_CATEGORY_ID',
'NEIGHBORHOOD_ID', 'Month', 'Day Name',  'VIOLENCE_RELATION', 'MJ_RELATION_TYPE'])

def get_top_abs_correlations(df, n=10, geography_and_offense_only=False):
    '''Tune n to change the number of top correlations'''
    au_corr = df.corr(method='spearman').unstack()
    au_corr = au_corr.sort_values(ascending=False)
    au_corr = au_corr.to_frame().reset_index()
    au_corr = au_corr[au_corr[0]!=1]
    au_corr.drop_duplicates(subset=0, inplace = True)
    au_corr.rename(columns={'level_0': 'Value1', 'level_1': 'Value2', 0:'Correlation_coef'}, inplace=True)
    #filter insignificant correlations
    au_corr = au_corr[abs(au_corr.Correlation_coef)>0.1] 
    au_corrng = au_corr

    #filter the correlation df for pairs composed of ('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long')
    au_corr =  au_corr[au_corr.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long', 'dist'))]
    au_corr =  au_corr[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long', 'dist'))]
    #get the part where the first column is geographical and the second is offense category
    au_corr1 = au_corr[au_corr.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist'))]
    au_corr1 = au_corr1[au_corr1.Value2.str.startswith(('OFFENSE_CATEGORY_ID'))]
    #get the part where the first column is offense category and the second is geographical 
    au_corr2 =  au_corr[au_corr.Value1.str.startswith(('OFFENSE_CATEGORY_ID'))]
    au_corr2 = au_corr2[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist'))]
    #concatenate the two parts
    au_corr = pd.concat([au_corr1, au_corr2])
    au_corr = au_corr.sort_values(by='Correlation_coef', ascending=False)
            
    if not geography_and_offense_only:
        au_corr =  au_corrng[au_corrng.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist', 'Month', 'Day Name'))]
        au_corr =  au_corr[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'lat', 'long', 'dist', 'Month', 'Day Name'))]
        print(au_corr)
        au_corr=au_corrng.merge(au_corr, on=['Value1', 'Value2', 'Correlation_coef'], how='left', indicator=True)
        au_corr=au_corr[au_corr._merge=='left_only'].drop(axis=1, labels='_merge')
        au_corr = pd.concat([au_corr.iloc[0:n, :], au_corr.iloc[len(au_corr)-1-n:len(au_corr)-1, :]], axis=0)
        
    return au_corr

get_top_abs_correlations(corrdf)
                               Value1                     Value2  \
117                     DISTRICT_ID_5  NEIGHBORHOOD_ID_montbello   
121    NEIGHBORHOOD_ID_elyria-swansea              DISTRICT_ID_2   
123                               lat              DISTRICT_ID_2   
125                     DISTRICT_ID_7        NEIGHBORHOOD_ID_dia   
127                              long                       dist   
...                               ...                        ...   
13673                             lat   NEIGHBORHOOD_ID_overland   
13679                             lat              DISTRICT_ID_4   
13681                   DISTRICT_ID_6                       dist   
13685                   DISTRICT_ID_4                       long   
13687                             lat              DISTRICT_ID_3   

       Correlation_coef  
117            0.812210  
121            0.547706  
123            0.511313  
125            0.499374  
127            0.467455  
...                 ...  
13673         -0.328414  
13679         -0.429106  
13681         -0.463937  
13685         -0.562981  
13687         -0.595252  

[199 rows x 3 columns]
Value1 Value2 Correlation_coef
1 VIOLENCE_RELATION_violent OFFENSE_CATEGORY_ID_Burglary 0.625153
8 OFFENSE_CATEGORY_ID_Robbery-Street-Res MJ_RELATION_TYPE_NON-INDUSTRY\r 0.442271
30 NEIGHBORHOOD_ID_hale OFFENSE_CATEGORY_ID_Robbery-Business 0.274946
45 DISTRICT_ID_6 MJ_RELATION_TYPE_NON-INDUSTRY\r 0.198805
47 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_Agg ASLT-Other 0.191872
48 DISTRICT_ID_7 OFFENSE_CATEGORY_ID_Theft from Motor Vehicle 0.190352
49 OFFENSE_CATEGORY_ID_Robbery-Street-Res VIOLENCE_RELATION_violent 0.190183
63 MJ_RELATION_TYPE_NON-INDUSTRY\r NEIGHBORHOOD_ID_sloan-lake 0.161758
71 MJ_RELATION_TYPE_NON-INDUSTRY\r NEIGHBORHOOD_ID_gateway-green-valley-ranch 0.154407
72 OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti NEIGHBORHOOD_ID_speer 0.153741
241 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_Burglary -0.213144
242 OFFENSE_CATEGORY_ID_Drug Offenses OFFENSE_CATEGORY_ID_Burglary -0.217856
245 OFFENSE_CATEGORY_ID_All Other Crimes VIOLENCE_RELATION_violent -0.231180
249 VIOLENCE_RELATION_violent OFFENSE_CATEGORY_ID_Drug Offenses -0.269096
251 OFFENSE_CATEGORY_ID_Burglary OFFENSE_CATEGORY_ID_Criminal Mischief-Property -0.272626
254 OFFENSE_CATEGORY_ID_Burglary MJ_RELATION_TYPE_NON-INDUSTRY\r -0.290677
255 OFFENSE_CATEGORY_ID_Burglary OFFENSE_CATEGORY_ID_Robbery-Street-Res -0.299994
256 OFFENSE_CATEGORY_ID_Burglary OFFENSE_CATEGORY_ID_All Other Crimes -0.310034
259 OFFENSE_CATEGORY_ID_Criminal Mischief-Property VIOLENCE_RELATION_violent -0.336748
260 OFFENSE_CATEGORY_ID_Larceny OFFENSE_CATEGORY_ID_Burglary -0.397286
We may highlight that the table shows evident correlations arising from crime classification: e. g. strong positive correlation between violence and being burglary, the opposite for being larceny; between being street robbery and being non-industrial crime, the opposite - for being burglary. It is especially true for the largest negative correlations. We also see that the dataset does not have a plenty of highly correlated values (i. e. with absolute value over 0.5). Yes, there are some, but it is not that much for >100 variables.

With regard to question 1, we also need to look at correlations between offence types and geographycal variables. Let’s do this by filtering the correlation df.

Code
get_top_abs_correlations(corrdf, geography_and_offense_only=True)
Value1 Value2 Correlation_coef
179 NEIGHBORHOOD_ID_hale OFFENSE_CATEGORY_ID_Robbery-Business 0.274946
215 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_Agg ASLT-Other 0.191872
217 DISTRICT_ID_7 OFFENSE_CATEGORY_ID_Theft from Motor Vehicle 0.190352
267 OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti NEIGHBORHOOD_ID_speer 0.153741
275 NEIGHBORHOOD_ID_gateway-green-valley-ranch OFFENSE_CATEGORY_ID_Weapons Offense 0.149714
279 OFFENSE_CATEGORY_ID_Burglary DISTRICT_ID_3 0.147926
281 OFFENSE_CATEGORY_ID_Robbery-Street-Res NEIGHBORHOOD_ID_north-capitol-hill 0.143426
283 NEIGHBORHOOD_ID_cbd OFFENSE_CATEGORY_ID_Agg ASLT-Other 0.138566
295 OFFENSE_CATEGORY_ID_Theft from Motor Vehicle NEIGHBORHOOD_ID_bear-valley 0.131556
309 OFFENSE_CATEGORY_ID_Robbery-Business NEIGHBORHOOD_ID_clayton 0.127332
311 OFFENSE_CATEGORY_ID_Drug Offenses NEIGHBORHOOD_ID_civic-center 0.126731
317 OFFENSE_CATEGORY_ID_Robbery-Street-Res NEIGHBORHOOD_ID_west-colfax 0.123525
323 OFFENSE_CATEGORY_ID_Burglary dist 0.120521
339 NEIGHBORHOOD_ID_kennedy OFFENSE_CATEGORY_ID_Larceny 0.116221
341 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_Robbery-Street-Res 0.114490
353 NEIGHBORHOOD_ID_gateway-green-valley-ranch OFFENSE_CATEGORY_ID_Robbery-Street-Res 0.112044
361 OFFENSE_CATEGORY_ID_Robbery-Street-Res NEIGHBORHOOD_ID_congress-park 0.108788
365 NEIGHBORHOOD_ID_hampden-south OFFENSE_CATEGORY_ID_Robbery-Business 0.108485
367 NEIGHBORHOOD_ID_east-colfax OFFENSE_CATEGORY_ID_Auto Theft 0.108413
371 OFFENSE_CATEGORY_ID_Weapons Offense NEIGHBORHOOD_ID_barnum 0.106713
375 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_All Other Crimes 0.106354
383 OFFENSE_CATEGORY_ID_Drug Offenses NEIGHBORHOOD_ID_capitol-hill 0.104310
385 OFFENSE_CATEGORY_ID_Burglary NEIGHBORHOOD_ID_overland 0.104224
391 NEIGHBORHOOD_ID_gateway-green-valley-ranch OFFENSE_CATEGORY_ID_Drug Offenses 0.102977
393 NEIGHBORHOOD_ID_city-park OFFENSE_CATEGORY_ID_Simple Assault 0.102805
397 OFFENSE_CATEGORY_ID_Robbery-Street-Res NEIGHBORHOOD_ID_union-station 0.102064
401 OFFENSE_CATEGORY_ID_Drug Offenses NEIGHBORHOOD_ID_cherry-creek 0.102033
13441 NEIGHBORHOOD_ID_union-station OFFENSE_CATEGORY_ID_Burglary -0.102042
13499 NEIGHBORHOOD_ID_cbd OFFENSE_CATEGORY_ID_Burglary -0.118263
13549 dist OFFENSE_CATEGORY_ID_All Other Crimes -0.142257
13639 DISTRICT_ID_6 OFFENSE_CATEGORY_ID_Burglary -0.213144
Some hoods correlate with some crimes positively: positive correlation exists between business robberies and Hale neigborhood, aggravated assault and 6th district, theft from motor vehicle and 7the district, Speer neighbourhood and graffity offences.

The only outstanding negative correlation is between burglary and 6th district. With regard to the 6th and the 7th district we may conclude that their correlaions with other variables are mutually connected: e. g. district 6 is strongly correlated with aggravated assault; since this is a non-industrial crime, it is also strongly correlated with MJ_RELATION_TYPE_NON-INDUSTRY. Absolute values of the rest of the correlations are below 0.15.

As a result, we have got a dataset with a few collinearity issues.

Machine learning focused on geographycal predictors

Given the number of predictors, let us first try a random forest classifier. Moreover, this type of model is not sensitive to unscaled data.
We will first attempt using maximum of the predictors (dropping, however, OFFENSE_TYPE_ID in any event). Then we get a feature importance ranking and performance results aining at minimizing number of predictors and finding optimal features of the classifier.
Code
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
modeldf = df[['MJ_RELATION_TYPE', 'lat', 'long', 'DISTRICT_ID', 'OFFENSE_CATEGORY_ID',
'NEIGHBORHOOD_ID', 'Month', 'Day Name',  'VIOLENCE_RELATION', 'dist', 'duration']]
modeldf = pd.get_dummies(data=modeldf, drop_first=True, columns=['DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'NEIGHBORHOOD_ID', 'Month', 'Day Name',  'VIOLENCE_RELATION'])
modeldf.MJ_RELATION_TYPE = modeldf.MJ_RELATION_TYPE.apply(lambda x : 1 if (x == 'INDUSTRY\r') else 0)

def rf_model_assessment(predictors, target, outofbag=True, plot = True, maxdepth=None,
output1=True, output2 = False, nestimators=100, maxsamples=None, maxfeatures='sqrt', plottop=10):
    '''The function to plug different data into slightly less different RF-models'''
    
    x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.25, random_state=15)
    os = SMOTE(random_state=0)
    os_data_X,os_data_y=os.fit_resample(x_train, y_train)
    x_train = pd.DataFrame(data=os_data_X,columns=x_train.columns)
    y_train = pd.DataFrame(data=os_data_y)
    clf = RandomForestClassifier(oob_score=outofbag, max_depth=maxdepth, n_estimators=nestimators, max_samples=maxsamples, max_features=maxfeatures)
    clf.fit(x_train, y_train)

    if plot ==True:
        fidf = pd.DataFrame({'Feature': [], 'Feature importance score':[]})
        fidf.Feature=x_train.columns
        fidf['Feature importance score']=clf.feature_importances_
        fidf=fidf.sort_values('Feature importance score', ascending=False)
        fidf=fidf.iloc[:plottop, :]
        ax=sns.barplot(y=fidf.Feature, x=fidf['Feature importance score'], orient='h')
        ax.set_yticks(range(0, len(fidf.Feature)))
        ax.set_yticklabels(labels=list(fidf.Feature))
        plt.show()

    yhatrain = clf.predict(x_train)
    predictiontrain = list(yhatrain)
    cmtrain = metrics.confusion_matrix(y_train, predictiontrain)
    yhattest = clf.predict(x_test)
    predictiontest = list(yhattest)
    cmtest = metrics.confusion_matrix(y_test, predictiontest)

    if output1==True:
        print ("Train Confusion Matrix : \n", cmtrain, '\n')
        print("Train Accuracy : ", metrics.accuracy_score(y_train, predictiontrain),
        '\n')
        print("Train f1-score : ", metrics.f1_score(y_train, predictiontrain), '\n')
        print("Train Recall : ", metrics.recall_score(y_train, predictiontrain), '\n')
        print("Train Precision : ", metrics.precision_score(y_train, predictiontrain), '\n')
        print ("Test Confusion Matrix : \n", cmtest)
        print("Test Accuracy : ", metrics.accuracy_score(y_test, predictiontest), '\n')
        print("Test f1-score : ", metrics.f1_score(y_test, predictiontest), '\n')
        print("Test Recall : ", metrics.recall_score(y_test, predictiontest), '\n')
        print("Test Precision : ", metrics.precision_score(y_test, predictiontest), '\n')
    
    if output2==True:
        return float(metrics.f1_score(y_test, predictiontest))

rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, predictors=modeldf.iloc[:, 1:])

Train Confusion Matrix : 
 [[720   0]
 [  0 720]] 

Train Accuracy :  1.0 

Train f1-score :  1.0 

Train Recall :  1.0 

Train Precision :  1.0 

Test Confusion Matrix : 
 [[ 32  25]
 [ 11 232]]
Test Accuracy :  0.88 

Test f1-score :  0.928 

Test Recall :  0.9547325102880658 

Test Precision :  0.9027237354085603 

fig23. The model performs not bad, geographycal variables are among the most important ones again. However, we should focus on geo predictors. However, we will not drop non-geo predictors only if their importance ranking is high, otherwise we assume they are not significant.

Code
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, predictors=modeldf.iloc[:, 3:].drop(labels=['VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary'], axis=1))

Train Confusion Matrix : 
 [[720   0]
 [  0 720]] 

Train Accuracy :  1.0 

Train f1-score :  1.0 

Train Recall :  1.0 

Train Precision :  1.0 

Test Confusion Matrix : 
 [[ 31  26]
 [ 17 226]]
Test Accuracy :  0.8566666666666667 

Test f1-score :  0.9131313131313132 

Test Recall :  0.9300411522633745 

Test Precision :  0.8968253968253969 

fig24. The performance is slightly worse, however, it is not a great cost for dropping such important predictors.

The next thing we do is attempting dropping the most significant predictors again to decrease computational cost of the model

Code
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, 
predictors=modeldf.iloc[:, 1:].drop(labels=['OFFENSE_CATEGORY_ID_Robbery-Street-Res',
'VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary'], axis=1))

Train Confusion Matrix : 
 [[720   0]
 [  0 720]] 

Train Accuracy :  1.0 

Train f1-score :  1.0 

Train Recall :  1.0 

Train Precision :  1.0 

Test Confusion Matrix : 
 [[ 25  32]
 [ 18 225]]
Test Accuracy :  0.8333333333333334 

Test f1-score :  0.8999999999999999 

Test Recall :  0.9259259259259259 

Test Precision :  0.8754863813229572 

fig25. The performance is slightly worse again, but still acceptable. This time top-3 predictors are geographycal, but the model remains effective.

At this point we state the predictor subsetting reached a tolerable mark, and now it is time to tune parameters of the model.

The metric we will consider for this purpose will be the test f1-score because it comprises both precision and recall evaluation. We do not take into account the train f1-score because the modelling evidences training f1-score was close to perfect all the time. We visualize the parameters vs. f1-score first and then use the best-fitting parameters to obtain the final model.

Code
leest1 = []
leest2 = []
for i in range(1, 101, 1):
    leest1.append(rf_model_assessment(predictors=modeldf.iloc[:, 1:].drop(labels=['OFFENSE_CATEGORY_ID_Robbery-Street-Res', 'VIOLENCE_RELATION_violent','OFFENSE_CATEGORY_ID_Burglary'], axis=1),
    target=modeldf.MJ_RELATION_TYPE, plot = False, nestimators=i,
    output1=False, output2 = True))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 110, 10)))
ax.set_xlabel(xlabel='Number of trees')
ax.set_ylabel(ylabel='F1-score')
nestimators = leest2[leest1.index(max(leest1))]
plt.show()

leest1 = []
leest2 = []
for i in range(1, 50, 1):
    leest1.append(rf_model_assessment(predictors=modeldf.iloc[:, 1:].drop(labels=['OFFENSE_CATEGORY_ID_Robbery-Street-Res', 'VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary'], axis=1),
    target=modeldf.MJ_RELATION_TYPE, plot = False, maxdepth=i,
    output1=False, output2 = True))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 55, 5)))
ax.set_xlabel(xlabel='Maximum tree depth')
ax.set_ylabel(ylabel='F1-score')
maxdepth = leest2[leest1.index(max(leest1))]
plt.show()


leest1 = []
leest2 = []
for i in range(1, 50, 5):
    leest1.append(rf_model_assessment(predictors=modeldf.iloc[:, 1:].drop(labels=['OFFENSE_CATEGORY_ID_Robbery-Street-Res', 'VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary'], axis=1),
    target=modeldf.MJ_RELATION_TYPE, plot = False,
    output1=False, output2 = True, maxfeatures=i))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 55, 5)))
ax.set_xlabel(xlabel='Maximum features in tree')
ax.set_ylabel(ylabel='F1-score')
maxfeatures = leest2[leest1.index(max(leest1))]
plt.show()

fig.fig.26-28. Random forest classifier has a lot of parameters available, but we choose number of trees, tree depth and number of features included into each tree. Common issue of all the plots is absence of uniform trend which is typical for random forest algorithms.
Code
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, 
predictors=modeldf.iloc[:, 1:].drop(labels=['OFFENSE_CATEGORY_ID_Robbery-Street-Res',
'VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary'], axis=1),
nestimators=nestimators, maxdepth=maxdepth, maxfeatures=maxfeatures)

Train Confusion Matrix : 
 [[719   1]
 [  2 718]] 

Train Accuracy :  0.9979166666666667 

Train f1-score :  0.9979152189020153 

Train Recall :  0.9972222222222222 

Train Precision :  0.9986091794158554 

Test Confusion Matrix : 
 [[ 30  27]
 [ 26 217]]
Test Accuracy :  0.8233333333333334 

Test f1-score :  0.891170431211499 

Test Recall :  0.8930041152263375 

Test Precision :  0.889344262295082 

fig. 29. The model has not changed its focus on geo-predictors, the rest of the predictors have not promoted greately. The f1-score is now over 0.91, which is fairly high.

Due to limited interpretability issues of random forest models and desire to get an interpretable result, we now turn to logistic regression. The procedure for this will be as follows: first, we make a ROC-curve to compute the best classifition threshold considering the mean of train and test f1-scores (in random forest train performance was almost perfect all the time, so there was no point in computing the mean). Then we use the threshold to improve the model. All this is done alongside with predictor subsetting. Note that the model is assessed particularly with confusion matrix for both model training and testing. The number of observations seen in the training confusion matrix should not be surprising due to oversampling having been performed.
Code
def lr_roc(predictors, target):
    '''The function draws the ROC curve depending on the threshold we choose'''
    predictors=sm.add_constant(predictors)
    x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.25, random_state=1)
    columns = predictors.columns

    os = SMOTE(random_state=0)
    os_data_X,os_data_y=os.fit_resample(x_train, y_train)
    x_train = pd.DataFrame(data=os_data_X,columns=columns)
    x_train = preprocessing.normalize(x_train)
    x_train = pd.DataFrame(data=x_train,columns=columns)
    y_train = pd.DataFrame(data=os_data_y)
    normalizer = preprocessing.Normalizer().fit(x_train)
    x_test = normalizer.transform(x_test)

    clf=sm.Logit(y_train, x_train).fit(method='bfgs')
    yhattrain = clf.predict(x_train)
    yhattest = clf.predict(x_test)
    tpr=[]
    fpr=[]
    threscrit=[]
    thres=[]
        
    for i in range(0,100,1):
        predictiontrain = list(map(lambda x: 0 if x < (i/100) else 1, yhattrain))
        predictiontest = list(map(lambda x: 0 if x < (i/100) else 1, yhattest))
        cmtest = metrics.confusion_matrix(y_test, predictiontest)
        precisiontrain = metrics.precision_score(y_train, predictiontrain)
        precisiontest = metrics.precision_score(y_test, predictiontest)
        recalltrain = metrics.recall_score(y_train, predictiontrain)
        recalltest = metrics.recall_score(y_test, predictiontest)
        tntest, fptest = cmtest.ravel()[[0,1]]
        specificity = tntest / (tntest+fptest)
        tpr.append(recalltest)
        fpr.append(1-specificity)
        thres.append(i/100)

        if precisiontrain>0.7:
            threscrit.append((metrics.f1_score(y_test, predictiontest)+metrics.f1_score(y_train, predictiontrain))/2)
        else:
            threscrit.append(0)
               
    ax1=sns.lineplot(x=fpr, y=tpr)
    ax1.set_xticks(ticks = [z * 0.01 for z in range(0, 110, 10)])
    ax1.set_yticks(ticks = [z * 0.01 for z in range(0, 110, 10)])
    ax1.set_xlabel(xlabel='1 - specificity')
    ax1.set_ylabel(ylabel='Recall')
    ax2=sns.lineplot(x=[i*0.01 for i in range (0, 110, 10)], y=[i*0.01 for i in range (0, 110, 10)])
    ax2.lines[1].set_linestyle("--")
    ax1.set(ylim=(0, 1.05), xlim=(0, 1.05))
    ax2.set(ylim=(0, 1.05), xlim=(0, 1.05))
    
    thresopt = thres[threscrit.index(max(threscrit))]
    print('The optimal threshold based on mean test and train f1-score is', thresopt)
    plt.show()

def lr_model_assessment(predictors, target, output1=True, output2 = False, pvaluedf=True, threshold = 0.5):
    '''The function to plug different data into slightly less different logistic regression-models'''
   
    predictors=sm.add_constant(predictors)
    x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.25, random_state=1)
    columns = predictors.columns

    os = SMOTE(random_state=0)
    os_data_X,os_data_y=os.fit_resample(x_train, y_train)
    x_train = pd.DataFrame(data=os_data_X,columns=columns)
    x_train = preprocessing.normalize(x_train)
    x_train = pd.DataFrame(data=x_train,columns=columns)
    y_train = pd.DataFrame(data=os_data_y)
    normalizer = preprocessing.Normalizer().fit(x_train)
    x_test = normalizer.transform(x_test)

    clf=sm.Logit(y_train, x_train).fit(method='bfgs')
    print(clf.summary())

    yhattrain = clf.predict(x_train)
    yhattrain = list(map(lambda x: 0 if x < threshold else 1, yhattrain))
    cmtrain = metrics.confusion_matrix(y_train, yhattrain)
    yhattest = clf.predict(x_test)
    yhattest = list(map(lambda x: 0 if x < threshold else 1, yhattest))
    
    cmtest = metrics.confusion_matrix(y_test, yhattest)

    if output1==True:
        print ("Train Confusion Matrix : \n", cmtrain, '\n')
        print("Train Accuracy : ", metrics.accuracy_score(y_train, yhattrain),'\n')
        print("Train f1-score : ", metrics.f1_score(y_train, yhattrain), '\n')
        print("Train Recall : ", metrics.recall_score(y_train, yhattrain), '\n')
        print("Train Precision : ", metrics.precision_score(y_train, yhattrain), '\n')
        print ("Test Confusion Matrix : \n", cmtest)
        print("Test Accuracy : ", metrics.accuracy_score(y_test, yhattest), '\n')
        print("Test f1-score : ", metrics.f1_score(y_test, yhattest), '\n')
        print("Test Recall : ", metrics.recall_score(y_test, yhattest), '\n')
        print("Test Precision : ", metrics.precision_score(y_test, yhattest), '\n')
    
    if output2==True:
        return float(metrics.f1_score(y_test, yhattest))

    if pvaluedf==True:
        pvaluedf=pd.DataFrame({'coefficient':clf.params, 'p-value':clf.pvalues}).reset_index()
        pvaluedf.drop(pvaluedf[pvaluedf['p-value']>0.05].index, inplace=True)
        pvaluedf=pvaluedf.sort_values(by=['p-value'], ascending=True)
        return pvaluedf

lrdf = modeldf

lr_roc(predictors=lrdf.iloc[:, 1:], target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.550117
         Iterations: 35
         Function evaluations: 40
         Gradient evaluations: 40
The optimal threshold based on mean test and train f1-score is 0.33

fig30. The ROC-curve for the model containing all the predictors is tolerable but it is clear the model will not be as accurate as the random forest one.

Code
lr_model_assessment(predictors=lrdf.iloc[:, 1:], target=lrdf.MJ_RELATION_TYPE, threshold=0.33)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.550117
         Iterations: 35
         Function evaluations: 40
         Gradient evaluations: 40
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1339
Method:                           MLE   Df Model:                          116
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.2063
Time:                        04:00:27   Log-Likelihood:                -800.97
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                 1.767e-35
==================================================================================================================
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
const                                             -0.0254   6.02e+04  -4.21e-07      1.000   -1.18e+05    1.18e+05
lat                                               -3.3746    658.436     -0.005      0.996   -1293.885    1287.135
long                                              -0.4392    448.328     -0.001      0.999    -879.145     878.267
dist                                              -9.4911      6.501     -1.460      0.144     -22.233       3.251
duration                                          60.3683     18.264      3.305      0.001      24.572      96.164
DISTRICT_ID_2                                     36.9233     44.686      0.826      0.409     -50.660     124.507
DISTRICT_ID_3                                     45.6058     59.807      0.763      0.446     -71.614     162.825
DISTRICT_ID_4                                     33.7402     48.723      0.692      0.489     -61.754     129.235
DISTRICT_ID_5                                     -9.2970    132.260     -0.070      0.944    -268.521     249.927
DISTRICT_ID_6                                    -47.3487     43.310     -1.093      0.274    -132.235      37.537
DISTRICT_ID_7                                     -0.4463    494.934     -0.001      0.999    -970.499     969.606
OFFENSE_CATEGORY_ID_Agg ASLT-Other                -2.1349     66.794     -0.032      0.975    -133.049     128.779
OFFENSE_CATEGORY_ID_All Other Crimes              17.6329     37.818      0.466      0.641     -56.490      91.756
OFFENSE_CATEGORY_ID_Auto Theft                     0.0543     84.597      0.001      0.999    -165.752     165.861
OFFENSE_CATEGORY_ID_Burglary                     160.3571     24.256      6.611      0.000     112.817     207.897
OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti     1.3762    145.926      0.009      0.992    -284.633     287.386
OFFENSE_CATEGORY_ID_Criminal Mischief-Property    19.2841     43.808      0.440      0.660     -66.578     105.146
OFFENSE_CATEGORY_ID_Criminal Mischief-Vehicle      0.4862    141.132      0.003      0.997    -276.127     277.099
OFFENSE_CATEGORY_ID_Drug Offenses                  3.9784     51.708      0.077      0.939     -97.368     105.324
OFFENSE_CATEGORY_ID_Larceny                       18.8973     32.785      0.576      0.564     -45.361      83.155
OFFENSE_CATEGORY_ID_Robbery-Business               5.0163     74.934      0.067      0.947    -141.852     151.885
OFFENSE_CATEGORY_ID_Robbery-Street-Res           -79.3814     29.321     -2.707      0.007    -136.850     -21.913
OFFENSE_CATEGORY_ID_Simple Assault                -2.6322     54.632     -0.048      0.962    -109.709     104.444
OFFENSE_CATEGORY_ID_Theft from Motor Vehicle       0.1860     58.864      0.003      0.997    -115.186     115.558
OFFENSE_CATEGORY_ID_Weapons Offense                0.9207    139.348      0.007      0.995    -272.197     274.038
NEIGHBORHOOD_ID_auraria                           -3.5400     90.716     -0.039      0.969    -181.340     174.260
NEIGHBORHOOD_ID_baker                              4.2428     64.850      0.065      0.948    -122.861     131.347
NEIGHBORHOOD_ID_barnum                             1.4197     90.031      0.016      0.987    -175.039     177.878
NEIGHBORHOOD_ID_barnum-west                        0.4095    127.945      0.003      0.997    -250.358     251.177
NEIGHBORHOOD_ID_bear-valley                       -0.4451   6.89e+06  -6.46e-08      1.000   -1.35e+07    1.35e+07
NEIGHBORHOOD_ID_belcaro                            4.4318    105.191      0.042      0.966    -201.740     210.603
NEIGHBORHOOD_ID_berkeley                           0.4702    127.808      0.004      0.997    -250.030     250.970
NEIGHBORHOOD_ID_capitol-hill                       1.5164     86.824      0.017      0.986    -168.656     171.689
NEIGHBORHOOD_ID_cbd                               -4.7003     65.036     -0.072      0.942    -132.169     122.768
NEIGHBORHOOD_ID_cheesman-park                     -5.1579     62.669     -0.082      0.934    -127.987     117.672
NEIGHBORHOOD_ID_cherry-creek                       0.0194    179.093      0.000      1.000    -350.997     351.036
NEIGHBORHOOD_ID_city-park                          0.4535    147.951      0.003      0.998    -289.525     290.432
NEIGHBORHOOD_ID_city-park-west                    -4.9359     71.028     -0.069      0.945    -144.148     134.276
NEIGHBORHOOD_ID_civic-center                       0.4618    161.940      0.003      0.998    -316.934     317.857
NEIGHBORHOOD_ID_clayton                           -1.3497    113.736     -0.012      0.991    -224.269     221.569
NEIGHBORHOOD_ID_cole                               3.5627     65.530      0.054      0.957    -124.874     132.000
NEIGHBORHOOD_ID_college-view-south-platte          7.0529     58.104      0.121      0.903    -106.829     120.935
NEIGHBORHOOD_ID_congress-park                     -0.4462    242.815     -0.002      0.999    -476.355     475.463
NEIGHBORHOOD_ID_cory-merrill                       0.4411    276.328      0.002      0.999    -541.152     542.034
NEIGHBORHOOD_ID_dia                               -2.6886    169.717     -0.016      0.987    -335.327     329.950
NEIGHBORHOOD_ID_east-colfax                        0.0110     65.785      0.000      1.000    -128.926     128.948
NEIGHBORHOOD_ID_elyria-swansea                    29.1680     51.625      0.565      0.572     -72.015     130.351
NEIGHBORHOOD_ID_five-points                        1.6135     45.012      0.036      0.971     -86.609      89.836
NEIGHBORHOOD_ID_fort-logan                         0.4372    260.749      0.002      0.999    -510.622     511.497
NEIGHBORHOOD_ID_gateway-green-valley-ranch        -4.4641    159.537     -0.028      0.978    -317.151     308.223
NEIGHBORHOOD_ID_globeville                        11.6952     57.861      0.202      0.840    -101.710     125.100
NEIGHBORHOOD_ID_goldsmith                          1.3325    125.942      0.011      0.992    -245.509     248.174
NEIGHBORHOOD_ID_hale                               2.2744    122.674      0.019      0.985    -238.162     242.711
NEIGHBORHOOD_ID_hampden                            3.9520     88.913      0.044      0.965    -170.314     178.218
NEIGHBORHOOD_ID_hampden-south                      0.4373    249.650      0.002      0.999    -488.868     489.742
NEIGHBORHOOD_ID_harvey-park                       -1.3527    147.349     -0.009      0.993    -290.151     287.445
NEIGHBORHOOD_ID_harvey-park-south                 -0.4742    304.403     -0.002      0.999    -597.093     596.144
NEIGHBORHOOD_ID_highland                           1.8737     78.438      0.024      0.981    -151.861     155.609
NEIGHBORHOOD_ID_indian-creek                       0.4305    154.248      0.003      0.998    -301.889     302.750
NEIGHBORHOOD_ID_jefferson-park                     1.3609    150.987      0.009      0.993    -294.568     297.290
NEIGHBORHOOD_ID_kennedy                           -1.3647    162.841     -0.008      0.993    -320.527     317.798
NEIGHBORHOOD_ID_lincoln-park                       6.0029     48.463      0.124      0.901     -88.983     100.989
NEIGHBORHOOD_ID_lowry-field                       -0.4483    271.349     -0.002      0.999    -532.283     531.387
NEIGHBORHOOD_ID_mar-lee                            2.7113     69.377      0.039      0.969    -133.264     138.687
NEIGHBORHOOD_ID_marston                            1.7390    151.973      0.011      0.991    -296.122     299.600
NEIGHBORHOOD_ID_montbello                          3.2168    114.071      0.028      0.978    -220.357     226.791
NEIGHBORHOOD_ID_montclair                          2.2337    105.861      0.021      0.983    -205.250     209.717
NEIGHBORHOOD_ID_north-capitol-hill                -7.4346     72.017     -0.103      0.918    -148.586     133.717
NEIGHBORHOOD_ID_north-park-hill                   -0.9110    172.248     -0.005      0.996    -338.511     336.689
NEIGHBORHOOD_ID_northeast-park-hill               10.6224     58.115      0.183      0.855    -103.281     124.526
NEIGHBORHOOD_ID_overland                          28.6090     60.641      0.472      0.637     -90.244     147.462
NEIGHBORHOOD_ID_platt-park                         3.9874     82.451      0.048      0.961    -157.613     165.588
NEIGHBORHOOD_ID_regis                              0.4397    156.538      0.003      0.998    -306.369     307.248
NEIGHBORHOOD_ID_rosedale                           5.2854     88.643      0.060      0.952    -168.451     179.022
NEIGHBORHOOD_ID_ruby-hill                          9.7928     68.108      0.144      0.886    -123.697     143.282
NEIGHBORHOOD_ID_skyland                           -1.3667    150.464     -0.009      0.993    -296.270     293.537
NEIGHBORHOOD_ID_sloan-lake                        -7.2550     68.593     -0.106      0.916    -141.694     127.184
NEIGHBORHOOD_ID_south-park-hill                    2.6772    122.088      0.022      0.983    -236.610     241.964
NEIGHBORHOOD_ID_southmoor-park                    -0.4659    239.215     -0.002      0.998    -469.318     468.386
NEIGHBORHOOD_ID_speer                              2.7153    108.128      0.025      0.980    -209.212     214.642
NEIGHBORHOOD_ID_stapleton                         -0.4796    129.072     -0.004      0.997    -253.456     252.497
NEIGHBORHOOD_ID_sun-valley                         7.6561     69.761      0.110      0.913    -129.072     144.385
NEIGHBORHOOD_ID_sunnyside                          4.0955     71.091      0.058      0.954    -135.240     143.430
NEIGHBORHOOD_ID_union-station                     -2.0381     78.056     -0.026      0.979    -155.024     150.948
NEIGHBORHOOD_ID_university                        -0.4715    159.220     -0.003      0.998    -312.537     311.594
NEIGHBORHOOD_ID_university-hills                   3.1071    106.458      0.029      0.977    -205.546     211.761
NEIGHBORHOOD_ID_university-park                   -1.3713    151.567     -0.009      0.993    -298.437     295.695
NEIGHBORHOOD_ID_valverde                          12.9486     63.443      0.204      0.838    -111.398     137.295
NEIGHBORHOOD_ID_villa-park                         1.3602     86.059      0.016      0.987    -167.313     170.033
NEIGHBORHOOD_ID_virginia-village                   3.0695     91.111      0.034      0.973    -175.505     181.644
NEIGHBORHOOD_ID_washington-park                   -0.0215    190.500     -0.000      1.000    -373.394     373.351
NEIGHBORHOOD_ID_washington-park-west               3.5977    103.113      0.035      0.972    -198.501     205.696
NEIGHBORHOOD_ID_washington-virginia-vale          -2.7776     64.871     -0.043      0.966    -129.922     124.366
NEIGHBORHOOD_ID_wellshire                          1.7977    131.173      0.014      0.989    -255.297     258.892
NEIGHBORHOOD_ID_west-colfax                       -3.0403     68.661     -0.044      0.965    -137.614     131.533
NEIGHBORHOOD_ID_west-highland                      2.7856    109.044      0.026      0.980    -210.937     216.509
NEIGHBORHOOD_ID_westwood                           3.9961     75.233      0.053      0.958    -143.459     151.451
NEIGHBORHOOD_ID_whittier                          -0.4460    241.178     -0.002      0.999    -473.147     472.255
NEIGHBORHOOD_ID_windsor                            2.2100    125.791      0.018      0.986    -244.336     248.756
Month_2                                           25.5084     37.335      0.683      0.494     -47.666      98.683
Month_3                                            7.6853     38.829      0.198      0.843     -68.418      83.788
Month_4                                           19.8434     39.702      0.500      0.617     -57.971      97.658
Month_5                                           15.2611     33.226      0.459      0.646     -49.861      80.384
Month_6                                           -0.2902     29.927     -0.010      0.992     -58.947      58.366
Month_7                                            9.5228     30.538      0.312      0.755     -50.331      69.376
Month_8                                           29.2523     31.149      0.939      0.348     -31.799      90.304
Month_9                                           25.4496     31.397      0.811      0.418     -36.087      86.986
Month_10                                           8.5986     35.177      0.244      0.807     -60.347      77.544
Month_11                                          11.7643     34.076      0.345      0.730     -55.023      78.552
Month_12                                          14.3059     40.003      0.358      0.721     -64.098      92.710
Day Name_Monday                                   21.4675     25.265      0.850      0.395     -28.050      70.985
Day Name_Saturday                                 15.5173     27.473      0.565      0.572     -38.329      69.363
Day Name_Sunday                                   18.1373     27.400      0.662      0.508     -35.566      71.840
Day Name_Thursday                                 25.5342     25.142      1.016      0.310     -23.744      74.812
Day Name_Tuesday                                  19.1222     26.955      0.709      0.478     -33.709      71.954
Day Name_Wednesday                                14.1518     24.931      0.568      0.570     -34.711      63.015
VIOLENCE_RELATION_violent                         60.4797     22.284      2.714      0.007      16.804     104.155
==================================================================================================================
Train Confusion Matrix : 
 [[458 270]
 [ 74 654]] 

Train Accuracy :  0.7637362637362637 

Train f1-score :  0.7917675544794188 

Train Recall :  0.8983516483516484 

Train Precision :  0.7077922077922078 

Test Confusion Matrix : 
 [[ 19  46]
 [ 23 212]]
Test Accuracy :  0.77 

Test f1-score :  0.8600405679513183 

Test Recall :  0.902127659574468 

Test Precision :  0.8217054263565892 
index coefficient p-value
14 OFFENSE_CATEGORY_ID_Burglary 160.357112 3.814328e-11
4 duration 60.368274 9.484252e-04
116 VIOLENCE_RELATION_violent 60.479712 6.646060e-03
21 OFFENSE_CATEGORY_ID_Robbery-Street-Res -79.381362 6.783270e-03
Peformance of the model is not bad, however, a tiny number of coefficients with low p-values indicates that the logistic regression is puzzled. Particularly, if we subset take only the predictors with low p-values, the model will not function properly.
Code
lr_roc(predictors=lrdf[['VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary', 'duration', 'OFFENSE_CATEGORY_ID_Robbery-Street-Res']], target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.542805
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
The optimal threshold based on mean test and train f1-score is 0.67

fig31. The ROC-curve does not look promising.

Code
lr_model_assessment(predictors=lrdf[['VIOLENCE_RELATION_violent', 'OFFENSE_CATEGORY_ID_Burglary', 'duration', 'OFFENSE_CATEGORY_ID_Robbery-Street-Res']], target=lrdf.MJ_RELATION_TYPE, pvaluedf=False, threshold=0.67)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.542805
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1451
Method:                           MLE   Df Model:                            4
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.2169
Time:                        04:00:31   Log-Likelihood:                -790.32
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                 1.889e-93
==========================================================================================================
                                             coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------
const                                     -0.4761      0.102     -4.647      0.000      -0.677      -0.275
VIOLENCE_RELATION_violent                  0.9844      0.211      4.663      0.000       0.571       1.398
OFFENSE_CATEGORY_ID_Burglary               1.4768      0.220      6.713      0.000       1.046       1.908
duration                                  -0.0620      0.243     -0.255      0.799      -0.539       0.415
OFFENSE_CATEGORY_ID_Robbery-Street-Res    -5.9838      0.595    -10.063      0.000      -7.149      -4.818
==========================================================================================================
Train Confusion Matrix : 
 [[661  67]
 [277 451]] 

Train Accuracy :  0.7637362637362637 

Train f1-score :  0.7239165329052969 

Train Recall :  0.6195054945054945 

Train Precision :  0.8706563706563707 

Test Confusion Matrix : 
 [[ 50  15]
 [106 129]]
Test Accuracy :  0.5966666666666667 

Test f1-score :  0.6807387862796833 

Test Recall :  0.548936170212766 

Test Precision :  0.8958333333333334 

And the performance is indeed poor. However, if we try using geographycal predictors only, the result will not be significantly better.

Code
districts = lrdf.iloc[:, 25:99]
hoods = lrdf.iloc[:, 5:11]
geopredictors = pd.concat(objs= [districts, hoods, lrdf[['lat', 'long','dist']]], axis=1)
lr_roc(predictors=geopredictors, target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.605019
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
The optimal threshold based on mean test and train f1-score is 0.59

fig32. The ROC-curve looks better…

Code
lr_model_assessment(predictors=geopredictors, target=lrdf.MJ_RELATION_TYPE, threshold=0.59)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.605019
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1372
Method:                           MLE   Df Model:                           83
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.1271
Time:                        04:00:36   Log-Likelihood:                -880.91
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                 1.272e-19
==============================================================================================================
                                                 coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
const                                         -0.0539   5.65e+04  -9.53e-07      1.000   -1.11e+05    1.11e+05
NEIGHBORHOOD_ID_auraria                      -19.5770     67.683     -0.289      0.772    -152.233     113.079
NEIGHBORHOOD_ID_baker                          9.1948     56.751      0.162      0.871    -102.034     120.424
NEIGHBORHOOD_ID_barnum                         4.1652     82.204      0.051      0.960    -156.951     165.281
NEIGHBORHOOD_ID_barnum-west                   -0.0192     97.730     -0.000      1.000    -191.566     191.528
NEIGHBORHOOD_ID_bear-valley                   -1.3956    230.640     -0.006      0.995    -453.442     450.650
NEIGHBORHOOD_ID_belcaro                       13.9151     88.450      0.157      0.875    -159.445     187.275
NEIGHBORHOOD_ID_berkeley                      -1.3627     97.935     -0.014      0.989    -193.311     190.586
NEIGHBORHOOD_ID_capitol-hill                   4.3823    100.090      0.044      0.965    -191.790     200.554
NEIGHBORHOOD_ID_cbd                          -28.9125     74.950     -0.386      0.700    -175.811     117.986
NEIGHBORHOOD_ID_cheesman-park                -35.7797     72.821     -0.491      0.623    -178.505     106.946
NEIGHBORHOOD_ID_cherry-creek                  -0.0198    178.866     -0.000      1.000    -350.590     350.550
NEIGHBORHOOD_ID_city-park                      1.3870    136.728      0.010      0.992    -266.595     269.369
NEIGHBORHOOD_ID_city-park-west               -16.5233     88.130     -0.187      0.851    -189.254     156.208
NEIGHBORHOOD_ID_civic-center                   1.4501    174.082      0.008      0.993    -339.744     342.644
NEIGHBORHOOD_ID_clayton                       -2.8196    117.849     -0.024      0.981    -233.799     228.160
NEIGHBORHOOD_ID_cole                          12.4755     57.319      0.218      0.828     -99.867     124.818
NEIGHBORHOOD_ID_college-view-south-platte     20.8527     47.023      0.443      0.657     -71.311     113.016
NEIGHBORHOOD_ID_congress-park                 -1.4077    231.956     -0.006      0.995    -456.034     453.218
NEIGHBORHOOD_ID_cory-merrill                   1.3926    241.818      0.006      0.995    -472.562     475.348
NEIGHBORHOOD_ID_dia                           -8.1386    130.470     -0.062      0.950    -263.856     247.579
NEIGHBORHOOD_ID_east-colfax                   -1.4575     53.810     -0.027      0.978    -106.924     104.009
NEIGHBORHOOD_ID_elyria-swansea                91.3925     43.688      2.092      0.036       5.767     177.019
NEIGHBORHOOD_ID_five-points                  -14.9275     41.715     -0.358      0.720     -96.687      66.832
NEIGHBORHOOD_ID_fort-logan                     1.3962    231.229      0.006      0.995    -451.804     454.596
NEIGHBORHOOD_ID_gateway-green-valley-ranch   -13.6746    116.973     -0.117      0.907    -242.938     215.589
NEIGHBORHOOD_ID_globeville                    36.4531     50.075      0.728      0.467     -61.691     134.598
NEIGHBORHOOD_ID_goldsmith                      4.1620    112.668      0.037      0.971    -216.663     224.987
NEIGHBORHOOD_ID_hale                           6.9863    107.277      0.065      0.948    -203.273     217.245
NEIGHBORHOOD_ID_hampden                       12.4775     80.263      0.155      0.876    -144.836     169.791
NEIGHBORHOOD_ID_hampden-south                  1.3905    234.984      0.006      0.995    -459.171     461.952
NEIGHBORHOOD_ID_harvey-park                   -4.1976    134.212     -0.031      0.975    -267.249     258.854
NEIGHBORHOOD_ID_harvey-park-south             -1.3968    228.991     -0.006      0.995    -450.211     447.417
NEIGHBORHOOD_ID_highland                       7.0392     73.537      0.096      0.924    -137.091     151.169
NEIGHBORHOOD_ID_indian-creek                   1.3801    140.993      0.010      0.992    -274.961     277.721
NEIGHBORHOOD_ID_jefferson-park                 4.2141    132.870      0.032      0.975    -256.206     264.634
NEIGHBORHOOD_ID_kennedy                       -4.1901    141.114     -0.030      0.976    -280.769     272.388
NEIGHBORHOOD_ID_lincoln-park                  18.2486     45.222      0.404      0.687     -70.385     106.882
NEIGHBORHOOD_ID_lowry-field                   -1.4053    237.796     -0.006      0.995    -467.477     464.667
NEIGHBORHOOD_ID_mar-lee                        9.7338     60.382      0.161      0.872    -108.613     128.080
NEIGHBORHOOD_ID_marston                        5.5713    136.857      0.041      0.968    -262.663     273.806
NEIGHBORHOOD_ID_montbello                     13.2483     77.591      0.171      0.864    -138.827     165.323
NEIGHBORHOOD_ID_montclair                      6.9780     92.085      0.076      0.940    -173.506     187.462
NEIGHBORHOOD_ID_north-capitol-hill           -24.9070     85.388     -0.292      0.771    -192.265     142.451
NEIGHBORHOOD_ID_north-park-hill               -2.8092    163.197     -0.017      0.986    -322.670     317.051
NEIGHBORHOOD_ID_northeast-park-hill           26.3373     46.198      0.570      0.569     -64.209     116.884
NEIGHBORHOOD_ID_overland                      90.0139     55.255      1.629      0.103     -18.285     198.313
NEIGHBORHOOD_ID_platt-park                    12.4617     75.253      0.166      0.868    -135.032     159.955
NEIGHBORHOOD_ID_regis                          1.4148    142.988      0.010      0.992    -278.837     281.667
NEIGHBORHOOD_ID_rosedale                      16.6685     77.725      0.214      0.830    -135.669     169.006
NEIGHBORHOOD_ID_ruby-hill                     30.6764     55.222      0.556      0.579     -77.556     138.909
NEIGHBORHOOD_ID_skyland                       -2.8146    164.744     -0.017      0.986    -325.706     320.077
NEIGHBORHOOD_ID_sloan-lake                   -34.8450     52.955     -0.658      0.511    -138.635      68.945
NEIGHBORHOOD_ID_south-park-hill                8.3820     99.053      0.085      0.933    -185.759     202.523
NEIGHBORHOOD_ID_southmoor-park                -1.3967    235.356     -0.006      0.995    -462.685     459.892
NEIGHBORHOOD_ID_speer                          8.3227    100.834      0.083      0.934    -189.309     205.954
NEIGHBORHOOD_ID_stapleton                     -5.4132     95.266     -0.057      0.955    -192.132     181.305
NEIGHBORHOOD_ID_sun-valley                    23.8449     60.888      0.392      0.695     -95.493     143.183
NEIGHBORHOOD_ID_sunnyside                     14.0546     64.222      0.219      0.827    -111.818     139.927
NEIGHBORHOOD_ID_union-station                 -6.7590     92.627     -0.073      0.942    -188.304     174.786
NEIGHBORHOOD_ID_university                    -1.4219    142.596     -0.010      0.992    -280.905     278.061
NEIGHBORHOOD_ID_university-hills               9.7412     98.725      0.099      0.921    -183.757     203.240
NEIGHBORHOOD_ID_university-park               -4.2217    141.468     -0.030      0.976    -281.494     273.050
NEIGHBORHOOD_ID_valverde                      40.3794     51.985      0.777      0.437     -61.509     142.267
NEIGHBORHOOD_ID_villa-park                     2.8502     71.314      0.040      0.968    -136.923     142.623
NEIGHBORHOOD_ID_virginia-village               6.8877     76.812      0.090      0.929    -143.662     157.437
NEIGHBORHOOD_ID_washington-park               -0.0182    175.754     -0.000      1.000    -344.491     344.454
NEIGHBORHOOD_ID_washington-park-west          11.1075     89.651      0.124      0.901    -164.606     186.821
NEIGHBORHOOD_ID_washington-virginia-vale     -15.6217     56.147     -0.278      0.781    -125.669      94.425
NEIGHBORHOOD_ID_wellshire                      5.5668    125.165      0.044      0.965    -239.753     250.886
NEIGHBORHOOD_ID_west-colfax                   -9.7478     64.739     -0.151      0.880    -136.633     117.138
NEIGHBORHOOD_ID_west-highland                  8.4267    100.203      0.084      0.933    -187.968     204.822
NEIGHBORHOOD_ID_westwood                       9.7327     59.357      0.164      0.870    -106.606     126.071
NEIGHBORHOOD_ID_whittier                      -1.4078    232.181     -0.006      0.995    -456.474     453.659
NEIGHBORHOOD_ID_windsor                        6.9582    113.854      0.061      0.951    -216.192     230.108
DISTRICT_ID_2                                 79.2259     39.558      2.003      0.045       1.693     156.759
DISTRICT_ID_3                                128.4138     56.099      2.289      0.022      18.461     238.366
DISTRICT_ID_4                                 70.5228     43.128      1.635      0.102     -14.006     155.051
DISTRICT_ID_5                                -31.8336     94.929     -0.335      0.737    -217.891     154.224
DISTRICT_ID_6                               -158.0270     55.388     -2.853      0.004    -266.586     -49.468
DISTRICT_ID_7                                 -1.3521    334.152     -0.004      0.997    -656.278     653.574
lat                                           -8.8001    593.624     -0.015      0.988   -1172.282    1154.682
long                                          -3.2615    421.823     -0.008      0.994    -830.019     823.497
dist                                          -6.0827      5.816     -1.046      0.296     -17.482       5.316
==============================================================================================================
Train Confusion Matrix : 
 [[542 186]
 [285 443]] 

Train Accuracy :  0.676510989010989 

Train f1-score :  0.6529108327192336 

Train Recall :  0.6085164835164835 

Train Precision :  0.7042925278219396 

Test Confusion Matrix : 
 [[ 44  21]
 [ 89 146]]
Test Accuracy :  0.6333333333333333 

Test f1-score :  0.72636815920398 

Test Recall :  0.6212765957446809 

Test Precision :  0.874251497005988 
index coefficient p-value
79 DISTRICT_ID_6 -158.026969 0.004330
76 DISTRICT_ID_3 128.413803 0.022077
22 NEIGHBORHOOD_ID_elyria-swansea 91.392536 0.036442
75 DISTRICT_ID_2 79.225938 0.045202

…but the only satisfactory metric is precision. The rest of the metrics are below the standard.

To help the model, we go on with subsetting the predictors focusing on geographycal ones but adding some other variable on top of them.

Code
geomixpredictors = pd.concat(objs=[geopredictors.iloc[:, 0:80], lrdf.duration, lrdf.VIOLENCE_RELATION_violent], axis=1)
lr_roc(predictors=geomixpredictors, target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.478963
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
The optimal threshold based on mean test and train f1-score is 0.46

fig33. The ROC-curve of the model based on geographycal preditors, crime duration and violence duration

Code
lr_model_assessment(predictors=geomixpredictors, target=lrdf.MJ_RELATION_TYPE, threshold=0.46)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.478963
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1373
Method:                           MLE   Df Model:                           82
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.3090
Time:                        04:00:41   Log-Likelihood:                -697.37
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                 2.950e-84
==============================================================================================================
                                                 coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
const                                         -2.6270      0.476     -5.518      0.000      -3.560      -1.694
NEIGHBORHOOD_ID_auraria                       -2.5325      2.273     -1.114      0.265      -6.988       1.923
NEIGHBORHOOD_ID_baker                          0.0986      0.716      0.138      0.890      -1.304       1.501
NEIGHBORHOOD_ID_barnum                         1.1136      1.238      0.900      0.368      -1.312       3.539
NEIGHBORHOOD_ID_barnum-west                   -0.2691      1.683     -0.160      0.873      -3.567       3.029
NEIGHBORHOOD_ID_bear-valley                   -0.0176     97.845     -0.000      1.000    -191.790     191.755
NEIGHBORHOOD_ID_belcaro                        2.0616      1.452      1.419      0.156      -0.785       4.908
NEIGHBORHOOD_ID_berkeley                       0.3165      1.490      0.212      0.832      -2.604       3.237
NEIGHBORHOOD_ID_capitol-hill                   2.7883      1.406      1.984      0.047       0.033       5.543
NEIGHBORHOOD_ID_cbd                           -0.5216      1.346     -0.388      0.698      -3.159       2.116
NEIGHBORHOOD_ID_cheesman-park                 -0.1831      1.264     -0.145      0.885      -2.660       2.294
NEIGHBORHOOD_ID_cherry-creek                   0.0427      2.787      0.015      0.988      -5.420       5.506
NEIGHBORHOOD_ID_city-park                      0.3851      2.388      0.161      0.872      -4.294       5.065
NEIGHBORHOOD_ID_city-park-west                -0.0693      1.786     -0.039      0.969      -3.570       3.431
NEIGHBORHOOD_ID_civic-center                   0.5662      2.859      0.198      0.843      -5.037       6.169
NEIGHBORHOOD_ID_clayton                       -0.4637      2.286     -0.203      0.839      -4.944       4.017
NEIGHBORHOOD_ID_cole                           2.4068      0.971      2.479      0.013       0.504       4.310
NEIGHBORHOOD_ID_college-view-south-platte      1.5182      0.772      1.967      0.049       0.006       3.031
NEIGHBORHOOD_ID_congress-park                 -0.2954      4.304     -0.069      0.945      -8.732       8.141
NEIGHBORHOOD_ID_cory-merrill                   0.2319      4.031      0.058      0.954      -7.668       8.132
NEIGHBORHOOD_ID_dia                           -0.8864      2.486     -0.357      0.721      -5.760       3.987
NEIGHBORHOOD_ID_east-colfax                    1.2339      0.855      1.443      0.149      -0.442       2.909
NEIGHBORHOOD_ID_elyria-swansea                 7.5594      1.149      6.578      0.000       5.307       9.812
NEIGHBORHOOD_ID_five-points                    2.5883      0.756      3.423      0.001       1.106       4.070
NEIGHBORHOOD_ID_fort-logan                     0.2336      4.039      0.058      0.954      -7.683       8.150
NEIGHBORHOOD_ID_gateway-green-valley-ranch    -1.6849      2.916     -0.578      0.563      -7.401       4.031
NEIGHBORHOOD_ID_globeville                     5.7932      1.128      5.138      0.000       3.583       8.003
NEIGHBORHOOD_ID_goldsmith                      0.5595      1.851      0.302      0.762      -3.068       4.187
NEIGHBORHOOD_ID_hale                           1.4134      1.812      0.780      0.435      -2.138       4.965
NEIGHBORHOOD_ID_hampden                        1.8356      1.209      1.518      0.129      -0.535       4.206
NEIGHBORHOOD_ID_hampden-south                  0.2319      4.031      0.058      0.954      -7.668       8.132
NEIGHBORHOOD_ID_harvey-park                   -0.9627      2.391     -0.403      0.687      -5.648       3.723
NEIGHBORHOOD_ID_harvey-park-south             -0.3406      4.527     -0.075      0.940      -9.212       8.531
NEIGHBORHOOD_ID_highland                       2.2493      1.056      2.130      0.033       0.179       4.319
NEIGHBORHOOD_ID_indian-creek                   0.1274      2.361      0.054      0.957      -4.501       4.756
NEIGHBORHOOD_ID_jefferson-park                 1.2187      1.997      0.610      0.542      -2.695       5.132
NEIGHBORHOOD_ID_kennedy                       -0.8298      2.530     -0.328      0.743      -5.788       4.128
NEIGHBORHOOD_ID_lincoln-park                   3.1219      0.669      4.667      0.000       1.811       4.433
NEIGHBORHOOD_ID_lowry-field                   -0.3322      4.093     -0.081      0.935      -8.354       7.690
NEIGHBORHOOD_ID_mar-lee                        1.3860      0.920      1.506      0.132      -0.417       3.189
NEIGHBORHOOD_ID_marston                        0.8314      2.169      0.383      0.702      -3.420       5.083
NEIGHBORHOOD_ID_montbello                      2.9962      1.441      2.079      0.038       0.172       5.821
NEIGHBORHOOD_ID_montclair                      1.3038      1.586      0.822      0.411      -1.805       4.413
NEIGHBORHOOD_ID_north-capitol-hill            -2.0967      1.881     -1.115      0.265      -5.784       1.590
NEIGHBORHOOD_ID_north-park-hill               -0.9442      2.516     -0.375      0.707      -5.875       3.987
NEIGHBORHOOD_ID_northeast-park-hill            2.3238      0.726      3.200      0.001       0.900       3.747
NEIGHBORHOOD_ID_overland                       8.7556      2.053      4.266      0.000       4.733      12.779
NEIGHBORHOOD_ID_platt-park                     1.8418      1.117      1.649      0.099      -0.348       4.031
NEIGHBORHOOD_ID_regis                          0.5989      2.119      0.283      0.777      -3.554       4.751
NEIGHBORHOOD_ID_rosedale                       2.2639      1.251      1.809      0.070      -0.189       4.716
NEIGHBORHOOD_ID_ruby-hill                      3.7751      1.288      2.930      0.003       1.250       6.300
NEIGHBORHOOD_ID_skyland                       -0.4977      3.299     -0.151      0.880      -6.964       5.969
NEIGHBORHOOD_ID_sloan-lake                    -4.1796      2.334     -1.790      0.073      -8.755       0.396
NEIGHBORHOOD_ID_south-park-hill                1.7120      1.664      1.029      0.304      -1.550       4.974
NEIGHBORHOOD_ID_southmoor-park                -0.2744      3.954     -0.069      0.945      -8.025       7.476
NEIGHBORHOOD_ID_speer                          1.5681      1.410      1.112      0.266      -1.195       4.331
NEIGHBORHOOD_ID_stapleton                      0.7645      1.771      0.432      0.666      -2.706       4.235
NEIGHBORHOOD_ID_sun-valley                     4.9000      1.243      3.941      0.000       2.463       7.337
NEIGHBORHOOD_ID_sunnyside                      3.4666      1.001      3.463      0.001       1.505       5.428
NEIGHBORHOOD_ID_union-station                  1.4416      1.402      1.029      0.304      -1.305       4.189
NEIGHBORHOOD_ID_university                    -0.8589      1.957     -0.439      0.661      -4.695       2.977
NEIGHBORHOOD_ID_university-hills               1.7288      1.566      1.104      0.269      -1.340       4.797
NEIGHBORHOOD_ID_university-park               -0.9661      2.489     -0.388      0.698      -5.844       3.911
NEIGHBORHOOD_ID_valverde                       4.8461      1.290      3.757      0.000       2.318       7.374
NEIGHBORHOOD_ID_villa-park                     0.1115      1.118      0.100      0.921      -2.080       2.303
NEIGHBORHOOD_ID_virginia-village               0.8665      1.181      0.734      0.463      -1.448       3.181
NEIGHBORHOOD_ID_washington-park               -0.0428      2.804     -0.015      0.988      -5.539       5.454
NEIGHBORHOOD_ID_washington-park-west           2.2758      1.425      1.598      0.110      -0.516       5.068
NEIGHBORHOOD_ID_washington-virginia-vale      -1.5715      0.938     -1.676      0.094      -3.409       0.266
NEIGHBORHOOD_ID_wellshire                      1.0378      1.993      0.521      0.603      -2.869       4.944
NEIGHBORHOOD_ID_west-colfax                   -0.7379      1.284     -0.575      0.566      -3.255       1.779
NEIGHBORHOOD_ID_west-highland                  2.4844      1.361      1.825      0.068      -0.184       5.152
NEIGHBORHOOD_ID_westwood                       1.3613      1.032      1.320      0.187      -0.661       3.383
NEIGHBORHOOD_ID_whittier                      -0.2954      4.304     -0.069      0.945      -8.732       8.141
NEIGHBORHOOD_ID_windsor                        1.0911      1.875      0.582      0.561      -2.585       4.767
DISTRICT_ID_2                                  0.3442      0.649      0.530      0.596      -0.928       1.616
DISTRICT_ID_3                                  1.1333      0.637      1.779      0.075      -0.115       2.381
DISTRICT_ID_4                                  1.6840      0.585      2.880      0.004       0.538       2.830
DISTRICT_ID_5                                 -1.3275      1.411     -0.941      0.347      -4.094       1.439
DISTRICT_ID_6                                 -1.4033      0.918     -1.529      0.126      -3.202       0.396
DISTRICT_ID_7                                 -0.1809      6.156     -0.029      0.977     -12.246      11.884
duration                                       0.6533      0.325      2.013      0.044       0.017       1.289
VIOLENCE_RELATION_violent                      1.0998      0.247      4.453      0.000       0.616       1.584
==============================================================================================================
Train Confusion Matrix : 
 [[548 180]
 [120 608]] 

Train Accuracy :  0.7939560439560439 

Train f1-score :  0.8021108179419525 

Train Recall :  0.8351648351648352 

Train Precision :  0.7715736040609137 

Test Confusion Matrix : 
 [[ 33  32]
 [ 37 198]]
Test Accuracy :  0.77 

Test f1-score :  0.8516129032258065 

Test Recall :  0.8425531914893617 

Test Precision :  0.8608695652173913 
index coefficient p-value
22 NEIGHBORHOOD_ID_elyria-swansea 7.559398 4.766545e-11
0 const -2.626953 3.432056e-08
26 NEIGHBORHOOD_ID_globeville 5.793166 2.777628e-07
37 NEIGHBORHOOD_ID_lincoln-park 3.121932 3.054240e-06
82 VIOLENCE_RELATION_violent 1.099824 8.451458e-06
46 NEIGHBORHOOD_ID_overland 8.755627 1.993029e-05
57 NEIGHBORHOOD_ID_sun-valley 4.900007 8.099214e-05
63 NEIGHBORHOOD_ID_valverde 4.846095 1.719529e-04
58 NEIGHBORHOOD_ID_sunnyside 3.466593 5.333654e-04
23 NEIGHBORHOOD_ID_five-points 2.588267 6.183452e-04
45 NEIGHBORHOOD_ID_northeast-park-hill 2.323757 1.375104e-03
50 NEIGHBORHOOD_ID_ruby-hill 3.775062 3.391322e-03
77 DISTRICT_ID_4 1.683975 3.980580e-03
16 NEIGHBORHOOD_ID_cole 2.406841 1.318341e-02
33 NEIGHBORHOOD_ID_highland 2.249348 3.319999e-02
41 NEIGHBORHOOD_ID_montbello 2.996203 3.761172e-02
81 duration 0.653340 4.409515e-02
8 NEIGHBORHOOD_ID_capitol-hill 2.788294 4.730744e-02
17 NEIGHBORHOOD_ID_college-view-south-platte 1.518170 4.916711e-02

This way the model does manage to give good predictions.

Machine learning focused on other predictors

We have already built both random forest and logistic regression models using all the variables and then focusing on geo-related variables. Now we focus on time, violence relation and offense categories and follow the same sequence.
Code
catpredictors = pd.concat(objs=[modeldf.iloc[:, 4], modeldf.iloc[:, 11:25], modeldf.iloc[:, 99:]], axis=1)
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, predictors=catpredictors)

Train Confusion Matrix : 
 [[678  42]
 [ 36 684]] 

Train Accuracy :  0.9458333333333333 

Train f1-score :  0.946058091286307 

Train Recall :  0.95 

Train Precision :  0.9421487603305785 

Test Confusion Matrix : 
 [[ 32  25]
 [ 29 214]]
Test Accuracy :  0.82 

Test f1-score :  0.8879668049792531 

Test Recall :  0.8806584362139918 

Test Precision :  0.895397489539749 

fig34. Despite we have plenty of preditors the most important were two of three most common offense categories and violence flag. Time-related predictors did not have much importance.

The above barplot makes us suggest dropping the time-related predictors.
Code
catpredictorsnotime = pd.concat(objs=[modeldf.iloc[:, 11:25], modeldf.iloc[:, 116]], axis=1)
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, predictors=catpredictorsnotime)

Train Confusion Matrix : 
 [[486 234]
 [ 73 647]] 

Train Accuracy :  0.7868055555555555 

Train f1-score :  0.8082448469706435 

Train Recall :  0.8986111111111111 

Train Precision :  0.7343927355278093 

Test Confusion Matrix : 
 [[ 31  26]
 [ 20 223]]
Test Accuracy :  0.8466666666666667 

Test f1-score :  0.9065040650406505 

Test Recall :  0.9176954732510288 

Test Precision :  0.8955823293172691 

fig35. This model does not seem to be better. One should especially note bad performance on training set (which, in contrast to the test set, has equal number of industrial and non-industrial observations). So the time contributes much into the model’s answer.

However, if we try to refocus the model on time-related predictors only, this will not give a desired result.
Code
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, predictors=catpredictors.drop(labels=
['OFFENSE_CATEGORY_ID_Burglary','OFFENSE_CATEGORY_ID_Robbery-Street-Res', 'VIOLENCE_RELATION_violent'], axis=1))

Train Confusion Matrix : 
 [[616 104]
 [133 587]] 

Train Accuracy :  0.8354166666666667 

Train f1-score :  0.8320340184266478 

Train Recall :  0.8152777777777778 

Train Precision :  0.849493487698987 

Test Confusion Matrix : 
 [[ 15  42]
 [ 61 182]]
Test Accuracy :  0.6566666666666666 

Test f1-score :  0.7794432548179873 

Test Recall :  0.7489711934156379 

Test Precision :  0.8125 

fig36. Of course, we have not dropped all the predictors that are not time-related, but we excluded the top-3 important of them. The model did not crash, but its performance is far from good.

This is why we should be tuning the initial geo-free model that includes both time-related and category-related predictors.
Code
leest1 = []
leest2 = []
for i in range(1, 101, 1):
    leest1.append(rf_model_assessment(predictors=catpredictors,
    target=modeldf.MJ_RELATION_TYPE, plot = False, nestimators=i,
    output1=False, output2 = True))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 110, 10)))
ax.set_xlabel(xlabel='Number of trees')
ax.set_ylabel(ylabel='F1-score')
nestimators = leest2[leest1.index(max(leest1))]
plt.show()

leest1 = []
leest2 = []
for i in range(1, 50, 1):
    leest1.append(rf_model_assessment(predictors=catpredictors,
    target=modeldf.MJ_RELATION_TYPE, plot = False, maxdepth=i,
    output1=False, output2 = True))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 55, 5)))
ax.set_xlabel(xlabel='Maximum tree depth')
ax.set_ylabel(ylabel='F1-score')
maxdepth = leest2[leest1.index(max(leest1))]
plt.show()


leest1 = []
leest2 = []
for i in range(1, 50, 5):
    leest1.append(rf_model_assessment(predictors=catpredictors,
    target=modeldf.MJ_RELATION_TYPE, plot = False,
    output1=False, output2 = True, maxfeatures=i))
    leest2.append(i)
ax=sns.lineplot(x=leest2, y=leest1)
ax.set_xticks(ticks = list(range(0, 55, 5)))
ax.set_xlabel(xlabel='Maximum features in tree')
ax.set_ylabel(ylabel='F1-score')
maxfeatures = leest2[leest1.index(max(leest1))]
plt.show()

fig.fig.37-39. As it happened to the above plots of the same kind, the tendency is chaotic, but we can get use of the peaks recorded by the algorithm.

Code
rf_model_assessment(target=modeldf.MJ_RELATION_TYPE, 
predictors=catpredictors,
nestimators=nestimators, maxdepth=maxdepth, maxfeatures=maxfeatures)

Train Confusion Matrix : 
 [[607 113]
 [ 35 685]] 

Train Accuracy :  0.8972222222222223 

Train f1-score :  0.9025032938076416 

Train Recall :  0.9513888888888888 

Train Precision :  0.8583959899749374 

Test Confusion Matrix : 
 [[ 31  26]
 [ 17 226]]
Test Accuracy :  0.8566666666666667 

Test f1-score :  0.9131313131313132 

Test Recall :  0.9300411522633745 

Test Precision :  0.8968253968253969 

fig40. The barplot looks familiar despite some rank switches.

The f1-score is over .91 again - the model may be deemed capable enough.

In the same fashion we use the logistic regression here.

Code
lr_roc(predictors=catpredictors, target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.

         Current function value: 0.372277
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
The optimal threshold based on mean test and train f1-score is 0.33

fig41. The ROC curve of the model including all non-geo predictors.

Code
lr_model_assessment(predictors=catpredictors, target=lrdf.MJ_RELATION_TYPE, threshold =0.33)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.372277
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1422
Method:                           MLE   Df Model:                           33
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.4629
Time:                        04:02:22   Log-Likelihood:                -542.04
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                6.026e-175
==================================================================================================================
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
const                                             -6.4871      0.581    -11.161      0.000      -7.626      -5.348
duration                                           0.4529      0.329      1.376      0.169      -0.192       1.098
OFFENSE_CATEGORY_ID_Agg ASLT-Other                -0.6824      1.210     -0.564      0.573      -3.054       1.689
OFFENSE_CATEGORY_ID_All Other Crimes               4.2786      0.690      6.197      0.000       2.925       5.632
OFFENSE_CATEGORY_ID_Auto Theft                     0.4945      1.624      0.304      0.761      -2.689       3.678
OFFENSE_CATEGORY_ID_Burglary                       3.5825      0.473      7.572      0.000       2.655       4.510
OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti     0.6512      2.705      0.241      0.810      -4.650       5.953
OFFENSE_CATEGORY_ID_Criminal Mischief-Property     6.9129      0.991      6.973      0.000       4.970       8.856
OFFENSE_CATEGORY_ID_Criminal Mischief-Vehicle      0.4864      3.162      0.154      0.878      -5.712       6.684
OFFENSE_CATEGORY_ID_Drug Offenses                  2.8562      0.783      3.646      0.000       1.321       4.391
OFFENSE_CATEGORY_ID_Larceny                        3.9144      0.598      6.549      0.000       2.743       5.086
OFFENSE_CATEGORY_ID_Robbery-Business               1.9064      1.505      1.266      0.205      -1.044       4.857
OFFENSE_CATEGORY_ID_Robbery-Street-Res            -9.2780      1.329     -6.981      0.000     -11.883      -6.673
OFFENSE_CATEGORY_ID_Simple Assault                -1.5824      0.985     -1.607      0.108      -3.512       0.347
OFFENSE_CATEGORY_ID_Theft from Motor Vehicle       1.4019      1.068      1.312      0.189      -0.692       3.496
OFFENSE_CATEGORY_ID_Weapons Offense                0.6212      2.986      0.208      0.835      -5.231       6.474
Month_2                                            4.1369      0.771      5.365      0.000       2.626       5.648
Month_3                                            1.9532      0.722      2.707      0.007       0.539       3.368
Month_4                                            4.4303      0.841      5.270      0.000       2.783       6.078
Month_5                                            2.8148      0.638      4.415      0.000       1.565       4.064
Month_6                                            1.0743      0.577      1.863      0.062      -0.056       2.205
Month_7                                            1.0172      0.583      1.745      0.081      -0.126       2.160
Month_8                                            3.8096      0.685      5.562      0.000       2.467       5.152
Month_9                                            3.5665      0.579      6.162      0.000       2.432       4.701
Month_10                                           1.7855      0.634      2.818      0.005       0.543       3.028
Month_11                                           2.3611      0.682      3.462      0.001       1.024       3.698
Month_12                                           1.7361      0.664      2.616      0.009       0.435       3.037
Day Name_Monday                                    0.9693      0.483      2.006      0.045       0.022       1.916
Day Name_Saturday                                  1.3225      0.542      2.438      0.015       0.260       2.386
Day Name_Sunday                                    2.5113      0.673      3.730      0.000       1.192       3.831
Day Name_Thursday                                  1.1426      0.483      2.364      0.018       0.195       2.090
Day Name_Tuesday                                   1.1038      0.492      2.244      0.025       0.140       2.068
Day Name_Wednesday                                 1.3614      0.501      2.718      0.007       0.380       2.343
VIOLENCE_RELATION_violent                          3.4725      0.405      8.570      0.000       2.678       4.267
==================================================================================================================
Train Confusion Matrix : 
 [[520 208]
 [ 37 691]] 

Train Accuracy :  0.8317307692307693 

Train f1-score :  0.8494161032575291 

Train Recall :  0.9491758241758241 

Train Precision :  0.7686318131256952 

Test Confusion Matrix : 
 [[ 18  47]
 [ 13 222]]
Test Accuracy :  0.8 

Test f1-score :  0.8809523809523809 

Test Recall :  0.9446808510638298 

Test Precision :  0.8252788104089219 
index coefficient p-value
0 const -6.487139 6.349380e-29
33 VIOLENCE_RELATION_violent 3.472524 1.037485e-17
5 OFFENSE_CATEGORY_ID_Burglary 3.582486 3.689103e-14
12 OFFENSE_CATEGORY_ID_Robbery-Street-Res -9.278028 2.926936e-12
7 OFFENSE_CATEGORY_ID_Criminal Mischief-Property 6.912936 3.109326e-12
10 OFFENSE_CATEGORY_ID_Larceny 3.914438 5.788362e-11
3 OFFENSE_CATEGORY_ID_All Other Crimes 4.278566 5.769969e-10
23 Month_9 3.566527 7.192885e-10
22 Month_8 3.809604 2.673988e-08
16 Month_2 4.136917 8.093056e-08
18 Month_4 4.430288 1.363615e-07
19 Month_5 2.814802 1.009933e-05
29 Day Name_Sunday 2.511314 1.911298e-04
9 OFFENSE_CATEGORY_ID_Drug Offenses 2.856189 2.660082e-04
25 Month_11 2.361072 5.363265e-04
24 Month_10 1.785516 4.839572e-03
32 Day Name_Wednesday 1.361401 6.559605e-03
17 Month_3 1.953236 6.798976e-03
26 Month_12 1.736057 8.901725e-03
28 Day Name_Saturday 1.322534 1.475099e-02
30 Day Name_Thursday 1.142633 1.809819e-02
31 Day Name_Tuesday 1.103771 2.483787e-02
27 Day Name_Monday 0.969302 4.483122e-02

The model shows good performance. Let us try removing time-related predictors to compare the effect with the one we observed in random forest.

Code
lr_roc(predictors=catpredictorsnotime, target=lrdf.MJ_RELATION_TYPE)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.475193
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
The optimal threshold based on mean test and train f1-score is 0.4

fig42. The ROC curve of the model including all non-geo predictors except for time-related ones.

Code
lr_model_assessment(predictors=catpredictorsnotime, target=lrdf.MJ_RELATION_TYPE, threshold=0.4)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.475193
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       MJ_RELATION_TYPE   No. Observations:                 1456
Model:                          Logit   Df Residuals:                     1440
Method:                           MLE   Df Model:                           15
Date:                Sun, 29 Jan 2023   Pseudo R-squ.:                  0.3144
Time:                        04:02:26   Log-Likelihood:                -691.88
converged:                      False   LL-Null:                       -1009.2
Covariance Type:            nonrobust   LLR p-value:                1.504e-125
==================================================================================================================
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
const                                             -2.3700      0.552     -4.296      0.000      -3.451      -1.289
OFFENSE_CATEGORY_ID_Agg ASLT-Other                -2.9291      0.914     -3.204      0.001      -4.721      -1.137
OFFENSE_CATEGORY_ID_All Other Crimes               1.7354      0.634      2.738      0.006       0.493       2.978
OFFENSE_CATEGORY_ID_Auto Theft                     0.5716      1.068      0.535      0.592      -1.521       2.664
OFFENSE_CATEGORY_ID_Burglary                       1.7664      0.551      3.205      0.001       0.686       2.846
OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti     1.0281      1.902      0.541      0.589      -2.700       4.756
OFFENSE_CATEGORY_ID_Criminal Mischief-Property     5.1203      0.798      6.418      0.000       3.557       6.684
OFFENSE_CATEGORY_ID_Criminal Mischief-Vehicle      0.5594      2.055      0.272      0.785      -3.468       4.586
OFFENSE_CATEGORY_ID_Drug Offenses                  1.7109      0.659      2.596      0.009       0.419       3.003
OFFENSE_CATEGORY_ID_Larceny                        2.1370      0.595      3.589      0.000       0.970       3.304
OFFENSE_CATEGORY_ID_Robbery-Business               1.7576      1.344      1.308      0.191      -0.877       4.392
OFFENSE_CATEGORY_ID_Robbery-Street-Res            -6.8911      0.859     -8.026      0.000      -8.574      -5.208
OFFENSE_CATEGORY_ID_Simple Assault                -3.3570      0.839     -4.001      0.000      -5.001      -1.713
OFFENSE_CATEGORY_ID_Theft from Motor Vehicle       0.4023      0.760      0.530      0.596      -1.086       1.891
OFFENSE_CATEGORY_ID_Weapons Offense                0.8894      1.704      0.522      0.602      -2.451       4.230
VIOLENCE_RELATION_violent                          3.1449      0.305     10.299      0.000       2.546       3.743
==================================================================================================================
Train Confusion Matrix : 
 [[522 206]
 [128 600]] 

Train Accuracy :  0.7706043956043956 

Train f1-score :  0.7822685788787483 

Train Recall :  0.8241758241758241 

Train Precision :  0.7444168734491315 

Test Confusion Matrix : 
 [[ 37  28]
 [ 44 191]]
Test Accuracy :  0.76 

Test f1-score :  0.8414096916299559 

Test Recall :  0.8127659574468085 

Test Precision :  0.8721461187214612 
index coefficient p-value
15 VIOLENCE_RELATION_violent 3.144882 7.107352e-25
11 OFFENSE_CATEGORY_ID_Robbery-Street-Res -6.891107 1.010133e-15
6 OFFENSE_CATEGORY_ID_Criminal Mischief-Property 5.120320 1.385133e-10
0 const -2.370015 1.742351e-05
12 OFFENSE_CATEGORY_ID_Simple Assault -3.356978 6.306820e-05
9 OFFENSE_CATEGORY_ID_Larceny 2.136956 3.313968e-04
4 OFFENSE_CATEGORY_ID_Burglary 1.766365 1.349817e-03
1 OFFENSE_CATEGORY_ID_Agg ASLT-Other -2.929091 1.353612e-03
2 OFFENSE_CATEGORY_ID_All Other Crimes 1.735421 6.186186e-03
8 OFFENSE_CATEGORY_ID_Drug Offenses 1.710910 9.426407e-03

This model is substantilly worse than the very first one. In such a way, the best logistic regression model not taking into account geographycal data should include both time-related and category-related variables.

Conclusion

Conclusion on q1:

  1. From visualization: The overall map of offences is not distributed uniformly. For example, the south-east of the city seems to be less criminal. However, as for the rest of the crime classifications the map seems to be mostly uniform (all the types: by violence, by offense category, by MJ relation look mixed up without any attraction centers). To sum up, these questions may not be answered by means of mere visualization.
  2. From correlation:
    1. Positive correlation: between business robberies and Hale neigborhood, aggravated assault and 6th district, theft from motor vehicle and 7the district, Speer neighbourhood and graffity offences. This means these places are more vulnerable to these types of crimes. There is also a noticeable correlation between the distance to the city center and probability of burglary.
    2. Negative correlation: no significant negative correlation between any geographycal variable (be in DISTRICT_ID or NEIGHBORHOOD_ID) is present. The only exception is a negative correlation between burglary and 6th district (which means it is not likely to be burglared in the 6th dictrict).
  3. From visualization:The overall map of offences is not distributed uniformly. For example, the south-east of the city seems to be less criminal. However, as for the rest of the crime classifications the map seems to be mostly uniform (all the types: by violence, by offense category, by MJ relation look mixed up without any attraction centers). To sum up, these questions may not be answered by means of mere visualization.

Conclusion on q2:

  1. From visualization:
    1. The crimes presented in the dataset are rather violent than not and are rather connected with industrial MJ-objects than not. This conclusion is vastly determined by predominance of burglaries, and burglary is both violent and industrial. Share of more aggravated crimes, including those connected with violence against people, is below 10%.
    2. However, this does not mean that MJ makes people more violent. It would be more accurate to say that after MJ was legalized, places of its high concentration - industrial MJ-growing sites - appeared. If a person wants to get a lot of MJ, he or she is most likely to burglarize such site. Moreover, classification of burglaries as violent crimes is more a convention and does not match a common meaning of the word ‘violence’. In such a way, predominance of violent crimes may be explained not by severity of people of Denver and not by the MJ impact on mental health, but by the way the MJ-industry is organized and by criminological convention. The main outcome of this observation is an emphasized necessity to control security of MJ industrial sites as the most likely crime locale.
    3. One more notable fact is that a few crimes of the dataset are not connected with other crime object than drugs. In other words, after the MJ was legalized, not so many cases of purely drug crimes connected with it were registered. This demonstrates that MJ-consumers are relatively unconcerned with other hazardous substances.
    4. Finally, we should note that majority of crimes treat MJ as a property (burglaries, larcenies etc.). Therefore a typical MJ-criminal of Denver is not a deep-rooted drug addict but a person wishing to get the hands of MJ just like on any other asset. Hence the MJ-crimes should be mainly combatted by means intended for combatting crimes against property than by countermeasures against illegal drug circulation. For instance, eliminating poverty and social inequality would be more useful compared to increase of MJ-specific control measures.
  2. The correlation part of the report are not particularly valuable for replying the second question.
  3. The machine learning part of the report reveals high degree of importance of information about crime type and the time when it was committed to resolve the posed machine learning problem. However, it was discovered that no model where the crime type and its violence relation were the only predictors could achieve sufficient accuracy and f1-score. Consequently, a purely geographycal portrait of the crime is more valuable than a purely qualitative one. Deep analysis of crime type related random forest splits or crime type dummies’ coefficients for logistic regression would be misleading due to extreme skewness of crimes type map (e. g. a node ‘Burglary or not’ will be very informative for predicting not only ‘MJ_RELATION_TYPE’ but many other target variables as well, which just reminds us about number of burglaries in the dataset and their criminological attributes without recourse to actual features of the MJ-crimes population).

Conclusion on q3:

The posed machine learning problem was resolved by means of random forest and logistic regression. Both types of models proved themselves to be suitable to predict the target value of MJ_RELATION_TYPE.

The main limitation of all the machine learning performed (as well as of the machine learning interpretation performed in qq1,2) was the focus of MJ_RELATION_TYPE variable. In other words, some insights might be obtained from machine learning based aimed at other target values. However, due to the framework of this project full analysis was infeasible, and importance of preditors, their statistical significance were assessed for predicting indistrial or non-industrial nature of the offense only.

Nevertheless, the nature of the task and domain the research was conducted within require the model to be interpretable. As the random forest models optimal number of trees was over 20 in all cases, it was impossible to visualize the trees, so the only visualization means was feature importance ranking. However, the ranking itself is not informative in this case, we are more interested in the split thresholds we cannot visualize due to number of trees. This is why the logistic regression models fits the task better.

Main feature of the logistic regression analysis was constant limitation of the number of predictors. Due to dummification of categorical variables there were over 100 predictors, so initial logistic regression model misfunctioned: a few p-values were low enough to explore them further, coefficients could also be delusive. After predictor subsetting was performed, logistic regression model improved its performance and provided for many significant coefficients which could be interpretable. For instance, burglary dummy variable had a strongly positive coefficient (being a burglary adds up to the chances of the offense to be industrial) while the street robbery dummy coefficient was one of the most negative (as no street robbery is industrial).

However, machine learning was not very insightful in terms of new recommendation for the police not highlighted for questions 1 and 2. The general output of the models designed is that a crime is a very multifaceted phenomenon, and its features cannot be predicted with sufficient accuracy (in common meaning of this term) without predictors reflecting only the time when the crime was committed, only its location, only its type etc. The more predictors related to different crime aspects are at hand, the more successful the model is. In this respect we may recommend the police department of Denver to broaden the data they collect. For example, adding a number of criminals involved or at least a flag of a crime committed by a group, organized gang etc. would be very beneficial.